Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 38
Текст из файла (страница 38)
Specifically, we consider a zero-meanisotropic Gaussian governed by a single precision parameter α so thatp(w|α) = N (w|0, α−1 I)(3.52)and the corresponding posterior distribution over w is then given by (3.49) withmN1S−N= βSN ΦT t= αI + βΦT Φ.(3.53)(3.54)The log of the posterior distribution is given by the sum of the log likelihood andthe log of the prior and, as a function of w, takes the formln p(w|t) = −Nαβ{tn − wT φ(xn )}2 − wT w + const.22(3.55)n=1Maximization of this posterior distribution with respect to w is therefore equivalent to the minimization of the sum-of-squares error function with the addition of aquadratic regularization term, corresponding to (3.27) with λ = α/β.We can illustrate Bayesian learning in a linear basis function model, as well asthe sequential update of a posterior distribution, using a simple example involvingstraight-line fitting.
Consider a single input variable x, a single target variable t and1543. LINEAR MODELS FOR REGRESSIONa linear model of the form y(x, w) = w0 + w1 x. Because this has just two adaptive parameters, we can plot the prior and posterior distributions directly in parameterspace. We generate synthetic data from the function f (x, a) = a0 + a1 x with parameter values a0 = −0.3 and a1 = 0.5 by first choosing values of xn from the uniformdistribution U(x|−1, 1), then evaluating f (xn , a), and finally adding Gaussian noisewith standard deviation of 0.2 to obtain the target values tn . Our goal is to recoverthe values of a0 and a1 from such data, and we will explore the dependence on thesize of the data set.
We assume here that the noise variance is known and hence weset the precision parameter to its true value β = (1/0.2)2 = 25. Similarly, we fixthe parameter α to 2.0. We shall shortly discuss strategies for determining α andβ from the training data. Figure 3.7 shows the results of Bayesian learning in thismodel as the size of the data set is increased and demonstrates the sequential natureof Bayesian learning in which the current posterior distribution forms the prior whena new data point is observed. It is worth taking time to study this figure in detail asit illustrates several important aspects of Bayesian inference.
The first row of thisfigure corresponds to the situation before any data points are observed and shows aplot of the prior distribution in w space together with six samples of the functiony(x, w) in which the values of w are drawn from the prior. In the second row, wesee the situation after observing a single data point. The location (x, t) of the datapoint is shown by a blue circle in the right-hand column. In the left-hand column is aplot of the likelihood function p(t|x, w) for this data point as a function of w.
Notethat the likelihood function provides a soft constraint that the line must pass close tothe data point, where close is determined by the noise precision β. For comparison,the true parameter values a0 = −0.3 and a1 = 0.5 used to generate the data setare shown by a white cross in the plots in the left column of Figure 3.7. When wemultiply this likelihood function by the prior from the top row, and normalize, weobtain the posterior distribution shown in the middle plot on the second row. Samples of the regression function y(x, w) obtained by drawing samples of w from thisposterior distribution are shown in the right-hand plot. Note that these sample linesall pass close to the data point. The third row of this figure shows the effect of observing a second data point, again shown by a blue circle in the plot in the right-handcolumn.
The corresponding likelihood function for this second data point alone isshown in the left plot. When we multiply this likelihood function by the posteriordistribution from the second row, we obtain the posterior distribution shown in themiddle plot of the third row. Note that this is exactly the same posterior distributionas would be obtained by combining the original prior with the likelihood functionfor the two data points. This posterior has now been influenced by two data points,and because two points are sufficient to define a line this already gives a relativelycompact posterior distribution.
Samples from this posterior distribution give rise tothe functions shown in red in the third column, and we see that these functions passclose to both of the data points. The fourth row shows the effect of observing a totalof 20 data points. The left-hand plot shows the likelihood function for the 20th datapoint alone, and the middle plot shows the resulting posterior distribution that hasnow absorbed information from all 20 observations. Note how the posterior is muchsharper than in the third row.
In the limit of an infinite number of data points, the3.3. Bayesian Linear RegressionFigure 3.7155Illustration of sequential Bayesian learning for a simple linear model of the form y(x, w) =w0 + w1 x. A detailed description of this figure is given in the text.1563. LINEAR MODELS FOR REGRESSIONposterior distribution would become a delta function centred on the true parametervalues, shown by the white cross.Other forms of prior over the parameters can be considered. For instance, wecan generalize the Gaussian prior to give 1/qMMq α1αp(w|α) =exp −|wj |q(3.56)2 2Γ(1/q)2j =1in which q = 2 corresponds to the Gaussian distribution, and only in this case is theprior conjugate to the likelihood function (3.10).
Finding the maximum of the posterior distribution over w corresponds to minimization of the regularized error function(3.29). In the case of the Gaussian prior, the mode of the posterior distribution wasequal to the mean, although this will no longer hold if q = 2.3.3.2 Predictive distributionIn practice, we are not usually interested in the value of w itself but rather inmaking predictions of t for new values of x. This requires that we evaluate thepredictive distribution defined by(3.57)p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dwExercise 3.10in which t is the vector of target values from the training set, and we have omitted thecorresponding input vectors from the right-hand side of the conditioning statementsto simplify the notation.
The conditional distribution p(t|x, w, β) of the target variable is given by (3.8), and the posterior weight distribution is given by (3.49). Wesee that (3.57) involves the convolution of two Gaussian distributions, and so makinguse of the result (2.115) from Section 8.1.4, we see that the predictive distributiontakes the form2(3.58)p(t|x, t, α, β) = N (t|mTN φ(x), σN (x))2where the variance σN(x) of the predictive distribution is given by2(x) =σNExercise 3.111+ φ(x)T SN φ(x).β(3.59)The first term in (3.59) represents the noise on the data whereas the second termreflects the uncertainty associated with the parameters w.
Because the noise processand the distribution of w are independent Gaussians, their variances are additive.Note that, as additional data points are observed, the posterior distribution becomes2narrower. As a consequence it can be shown (Qazaz et al., 1997) that σN+1 (x) 2σN (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the varianceof the predictive distribution arises solely from the additive noise governed by theparameter β.As an illustration of the predictive distribution for Bayesian linear regressionmodels, let us return to the synthetic sinusoidal data set of Section 1.1.
In Figure 3.8,1573.3. Bayesian Linear Regression11tt00−1−10x110x10x11tt00−1−10x1Figure 3.8 Examples of the predictive distribution (3.58) for a model consisting of 9 Gaussian basis functionsof the form (3.4) using the synthetic sinusoidal data set of Section 1.1. See the text for a detailed discussion.we fit a model comprising a linear combination of Gaussian basis functions to datasets of various sizes and then look at the corresponding posterior distributions.
Herethe green curves correspond to the function sin(2πx) from which the data pointswere generated (with the addition of Gaussian noise). Data sets of size N = 1,N = 2, N = 4, and N = 25 are shown in the four plots by the blue circles. Foreach plot, the red curve shows the mean of the corresponding Gaussian predictivedistribution, and the red shaded region spans one standard deviation either side ofthe mean.
Note that the predictive uncertainty depends on x and is smallest in theneighbourhood of the data points. Also note that the level of uncertainty decreasesas more data points are observed.The plots in Figure 3.8 only show the point-wise predictive variance as a function of x. In order to gain insight into the covariance between the predictions atdifferent values of x, we can draw samples from the posterior distribution over w,and then plot the corresponding functions y(x, w), as shown in Figure 3.9.1583. LINEAR MODELS FOR REGRESSION11tt00−1−10x110x10x11tt00−1−10x1Figure 3.9 Plots of the function y(x, w) using samples from the posterior distributions over w corresponding tothe plots in Figure 3.8.Section 6.4Exercise 3.12Exercise 3.13If we used localized basis functions such as Gaussians, then in regions awayfrom the basis function centres, the contribution from the second term in the predictive variance (3.59) will go to zero, leaving only the noise contribution β −1 . Thus,the model becomes very confident in its predictions when extrapolating outside theregion occupied by the basis functions, which is generally an undesirable behaviour.This problem can be avoided by adopting an alternative Bayesian approach to regression known as a Gaussian process.Note that, if both w and β are treated as unknown, then we can introduce aconjugate prior distribution p(w, β) that, from the discussion in Section 2.3.6, willbe given by a Gaussian-gamma distribution (Denison et al., 2002).