Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 34
Текст из файла (страница 34)
The simplest form of linear regressionmodels are also linear functions of the input variables. However, we can obtain amuch more useful class of functions by taking linear combinations of a fixed set ofnonlinear functions of the input variables, known as basis functions. Such modelsare linear functions of the parameters, which gives them simple analytical properties,and yet can be nonlinear with respect to the input variables.1371383. LINEAR MODELS FOR REGRESSIONGiven a training data set comprising N observations {xn }, where n = 1, .
. . , N ,together with corresponding target values {tn }, the goal is to predict the value of tfor a new value of x. In the simplest approach, this can be done by directly constructing an appropriate function y(x) whose values for new inputs x constitute thepredictions for the corresponding values of t. More generally, from a probabilisticperspective, we aim to model the predictive distribution p(t|x) because this expressesour uncertainty about the value of t for each value of x. From this conditional distribution we can make predictions of t, for any new value of x, in such a way as tominimize the expected value of a suitably chosen loss function.
As discussed in Section 1.5.5, a common choice of loss function for real-valued variables is the squaredloss, for which the optimal solution is given by the conditional expectation of t.Although linear models have significant limitations as practical techniques forpattern recognition, particularly for problems involving input spaces of high dimensionality, they have nice analytical properties and form the foundation for more sophisticated models to be discussed in later chapters.3.1. Linear Basis Function ModelsThe simplest linear model for regression is one that involves a linear combination ofthe input variablesy(x, w) = w0 + w1 x1 + .
. . + wD xD(3.1)where x = (x1 , . . . , xD )T . This is often simply known as linear regression. The keyproperty of this model is that it is a linear function of the parameters w0 , . . . , wD . It isalso, however, a linear function of the input variables xi , and this imposes significantlimitations on the model. We therefore extend the class of models by consideringlinear combinations of fixed nonlinear functions of the input variables, of the formy(x, w) = w0 +M−1wj φj (x)(3.2)j =1where φj (x) are known as basis functions. By denoting the maximum value of theindex j by M − 1, the total number of parameters in this model will be M .The parameter w0 allows for any fixed offset in the data and is sometimes calleda bias parameter (not to be confused with ‘bias’ in a statistical sense).
It is oftenconvenient to define an additional dummy ‘basis function’ φ0 (x) = 1 so thaty(x, w) =M−1wj φj (x) = wT φ(x)(3.3)j =0where w = (w0 , . . . , wM −1 )T and φ = (φ0 , . . . , φM −1 )T . In many practical applications of pattern recognition, we will apply some form of fixed pre-processing,3.1. Linear Basis Function Models139or feature extraction, to the original data variables.
If the original variables comprise the vector x, then the features can be expressed in terms of the basis functions{φj (x)}.By using nonlinear basis functions, we allow the function y(x, w) to be a nonlinear function of the input vector x. Functions of the form (3.2) are called linearmodels, however, because this function is linear in w. It is this linearity in the parameters that will greatly simplify the analysis of this class of models. However, italso leads to some significant limitations, as we discuss in Section 3.6.The example of polynomial regression considered in Chapter 1 is a particularexample of this model in which there is a single input variable x, and the basis functions take the form of powers of x so that φj (x) = xj . One limitation of polynomialbasis functions is that they are global functions of the input variable, so that changesin one region of input space affect all other regions.
This can be resolved by dividingthe input space up into regions and fit a different polynomial in each region, leadingto spline functions (Hastie et al., 2001).There are many other possible choices for the basis functions, for example(x − µj )2φj (x) = exp −(3.4)2s2where the µj govern the locations of the basis functions in input space, and the parameter s governs their spatial scale. These are usually referred to as ‘Gaussian’basis functions, although it should be noted that they are not required to have a probabilistic interpretation, and in particular the normalization coefficient is unimportantbecause these basis functions will be multiplied by adaptive parameters wj .Another possibility is the sigmoidal basis function of the formx − µ j(3.5)φj (x) = σswhere σ(a) is the logistic sigmoid function defined byσ(a) =1.1 + exp(−a)(3.6)Equivalently, we can use the ‘tanh’ function because this is related to the logisticsigmoid by tanh(a) = 2σ(a) − 1, and so a general linear combination of logisticsigmoid functions is equivalent to a general linear combination of ‘tanh’ functions.These various choices of basis function are illustrated in Figure 3.1.Yet another possible choice of basis function is the Fourier basis, which leads toan expansion in sinusoidal functions.
Each basis function represents a specific frequency and has infinite spatial extent. By contrast, basis functions that are localizedto finite regions of input space necessarily comprise a spectrum of different spatialfrequencies. In many signal processing applications, it is of interest to consider basis functions that are localized in both space and frequency, leading to a class offunctions known as wavelets. These are also defined to be mutually orthogonal, tosimplify their application.
Wavelets are most applicable when the input values live1403. LINEAR MODELS FOR REGRESSION1110.50.750.7500.50.5−0.50.250.25−1−1010−1010−101Figure 3.1 Examples of basis functions, showing polynomials on the left, Gaussians of the form (3.4) in thecentre, and sigmoidal of the form (3.5) on the right.on a regular lattice, such as the successive time points in a temporal sequence, or thepixels in an image.
Useful texts on wavelets include Ogden (1997), Mallat (1999),and Vidakovic (1999).Most of the discussion in this chapter, however, is independent of the particularchoice of basis function set, and so for most of our discussion we shall not specifythe particular form of the basis functions, except for the purposes of numerical illustration. Indeed, much of our discussion will be equally applicable to the situationin which the vector φ(x) of basis functions is simply the identity φ(x) = x. Furthermore, in order to keep the notation simple, we shall focus on the case of a singletarget variable t.
However, in Section 3.1.5, we consider briefly the modificationsneeded to deal with multiple target variables.3.1.1 Maximum likelihood and least squaresIn Chapter 1, we fitted polynomial functions to data sets by minimizing a sumof-squares error function. We also showed that this error function could be motivatedas the maximum likelihood solution under an assumed Gaussian noise model. Letus return to this discussion and consider the least squares approach, and its relationto maximum likelihood, in more detail.As before, we assume that the target variable t is given by a deterministic function y(x, w) with additive Gaussian noise so thatt = y(x, w) + (3.7)where is a zero mean Gaussian random variable with precision (inverse variance)β. Thus we can writep(t|x, w, β) = N (t|y(x, w), β −1 ).Section 1.5.5(3.8)Recall that, if we assume a squared loss function, then the optimal prediction, for anew value of x, will be given by the conditional mean of the target variable.
In thecase of a Gaussian conditional distribution of the form (3.8), the conditional mean3.1. Linear Basis Function Modelswill be simply141E[t|x] =tp(t|x) dt = y(x, w).(3.9)Note that the Gaussian noise assumption implies that the conditional distribution oft given x is unimodal, which may be inappropriate for some applications. An extension to mixtures of conditional Gaussian distributions, which permit multimodalconditional distributions, will be discussed in Section 14.5.1.Now consider a data set of inputs X = {x1 , .
. . , xN } with corresponding targetvalues t1 , . . . , tN . We group the target variables {tn } into a column vector that wedenote by t where the typeface is chosen to distinguish it from a single observationof a multivariate target, which would be denoted t. Making the assumption thatthese data points are drawn independently from the distribution (3.8), we obtain thefollowing expression for the likelihood function, which is a function of the adjustableparameters w and β, in the formp(t|X, w, β) =NN (tn |wT φ(xn ), β −1 )(3.10)n=1where we have used (3.3). Note that in supervised learning problems such as regression (and classification), we are not seeking to model the distribution of the inputvariables.