Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 39
Текст из файла (страница 39)
In this case, thepredictive distribution is a Student’s t-distribution.3.3. Bayesian Linear Regression159Figure 3.10 The equivalent kernel k(x, x ) for the Gaussian basisfunctions in Figure 3.1, shown asa plot of x versus x , together withthree slices through this matrix corresponding to three different valuesof x. The data set used to generatethis kernel comprised 200 values ofx equally spaced over the interval(−1, 1).3.3.3 Equivalent kernelChapter 6The posterior mean solution (3.53) for the linear basis function model has an interesting interpretation that will set the stage for kernel methods, including Gaussianprocesses.
If we substitute (3.53) into the expression (3.3), we see that the predictivemean can be written in the formy(x, mN ) =mTN φ(x)= βφ(x) SN Φ t =TTNβφ(x)T SN φ(xn )tn(3.60)n=1where SN is defined by (3.51). Thus the mean of the predictive distribution at a pointx is given by a linear combination of the training set target variables tn , so that wecan writeNy(x, mN ) =k(x, xn )tn(3.61)n=1where the functionk(x, x ) = βφ(x)T SN φ(x )(3.62)is known as the smoother matrix or the equivalent kernel.
Regression functions, suchas this, which make predictions by taking linear combinations of the training settarget values are known as linear smoothers. Note that the equivalent kernel dependson the input values xn from the data set because these appear in the definition ofSN . The equivalent kernel is illustrated for the case of Gaussian basis functions inFigure 3.10 in which the kernel functions k(x, x ) have been plotted as a function ofx for three different values of x. We see that they are localized around x, and so themean of the predictive distribution at x, given by y(x, mN ), is obtained by forminga weighted combination of the target values in which data points close to x are givenhigher weight than points further removed from x.
Intuitively, it seems reasonablethat we should weight local evidence more strongly than distant evidence. Note thatthis localization property holds not only for the localized Gaussian basis functionsbut also for the nonlocal polynomial and sigmoidal basis functions, as illustrated inFigure 3.11.1603. LINEAR MODELS FOR REGRESSIONFigure 3.11 Examples of equivalent kernels k(x, x ) for x = 0plotted as a function of x , corresponding (left) to the polynomial basis functions and (right) to the sigmoidal basis functions shown in Figure 3.1. Note that these are localized functions of x even though thecorresponding basis functions arenonlocal.0.040.040.020.0200−101−101Further insight into the role of the equivalent kernel can be obtained by considering the covariance between y(x) and y(x ), which is given bycov[y(x), y(x )] = cov[φ(x)T w, wT φ(x )]= φ(x)T SN φ(x ) = β −1 k(x, x )(3.63)where we have made use of (3.49) and (3.62).
From the form of the equivalentkernel, we see that the predictive mean at nearby points will be highly correlated,whereas for more distant pairs of points the correlation will be smaller.The predictive distribution shown in Figure 3.8 allows us to visualize the pointwise uncertainty in the predictions, governed by (3.59). However, by drawing samples from the posterior distribution over w, and plotting the corresponding modelfunctions y(x, w) as in Figure 3.9, we are visualizing the joint uncertainty in theposterior distribution between the y values at two (or more) x values, as governed bythe equivalent kernel.The formulation of linear regression in terms of a kernel function suggests analternative approach to regression as follows. Instead of introducing a set of basisfunctions, which implicitly determines an equivalent kernel, we can instead definea localized kernel directly and use this to make predictions for new input vectors x,given the observed training set.
This leads to a practical framework for regression(and classification) called Gaussian processes, which will be discussed in detail inSection 6.4.We have seen that the effective kernel defines the weights by which the trainingset target values are combined in order to make a prediction at a new value of x, andit can be shown that these weights sum to one, in other wordsNk(x, xn ) = 1(3.64)n=1Exercise 3.14for all values of x. This intuitively pleasing result can easily be proven informallyy (x)by noting that the summation is equivalent to considering the predictive mean for a set of target data in which tn = 1 for all n. Provided the basis functions arelinearly independent, that there are more data points than basis functions, and thatone of the basis functions is constant (corresponding to the bias parameter), then it isclear that we can fit the training data exactly and hence that the predictive mean will3.4.
Bayesian Model ComparisonChapter 6161y (x) = 1, from which we obtain (3.64). Note that the kernel function canbe simply be negative as well as positive, so although it satisfies a summation constraint, thecorresponding predictions are not necessarily convex combinations of the trainingset target variables.Finally, we note that the equivalent kernel (3.62) satisfies an important propertyshared by kernel functions in general, namely that it can be expressed in the form aninner product with respect to a vector ψ(x) of nonlinear functions, so thatk(x, z) = ψ(x)T ψ(z)(3.65)1 /2where ψ(x) = β 1/2 SN φ(x).3.4. Bayesian Model ComparisonSection 1.5.4In Chapter 1, we highlighted the problem of over-fitting as well as the use of crossvalidation as a technique for setting the values of regularization parameters or forchoosing between alternative models. Here we consider the problem of model selection from a Bayesian perspective.
In this section, our discussion will be verygeneral, and then in Section 3.5 we shall see how these ideas can be applied to thedetermination of regularization parameters in linear regression.As we shall see, the over-fitting associated with maximum likelihood can beavoided by marginalizing (summing or integrating) over the model parameters instead of making point estimates of their values. Models can then be compared directly on the training data, without the need for a validation set. This allows allavailable data to be used for training and avoids the multiple training runs for eachmodel associated with cross-validation.
It also allows multiple complexity parameters to be determined simultaneously as part of the training process. For example,in Chapter 7 we shall introduce the relevance vector machine, which is a Bayesianmodel having one complexity parameter for every training data point.The Bayesian view of model comparison simply involves the use of probabilitiesto represent uncertainty in the choice of model, along with a consistent applicationof the sum and product rules of probability. Suppose we wish to compare a set of Lmodels {Mi } where i = 1, . .
. , L. Here a model refers to a probability distributionover the observed data D. In the case of the polynomial curve-fitting problem, thedistribution is defined over the set of target values t, while the set of input values Xis assumed to be known. Other types of model define a joint distributions over Xand t. We shall suppose that the data is generated from one of these models but weare uncertain which one. Our uncertainty is expressed through a prior probabilitydistribution p(Mi ).
Given a training set D, we then wish to evaluate the posteriordistribution(3.66)p(Mi |D) ∝ p(Mi )p(D|Mi ).The prior allows us to express a preference for different models. Let us simplyassume that all models are given equal prior probability. The interesting term isthe model evidence p(D|Mi ) which expresses the preference shown by the data for1623. LINEAR MODELS FOR REGRESSIONdifferent models, and we shall examine this term in more detail shortly. The modelevidence is sometimes also called the marginal likelihood because it can be viewedas a likelihood function over the space of models, in which the parameters have beenmarginalized out.
The ratio of model evidences p(D|Mi )/p(D|Mj ) for two modelsis known as a Bayes factor (Kass and Raftery, 1995).Once we know the posterior distribution over models, the predictive distributionis given, from the sum and product rules, byp(t|x, D) =Lp(t|x, Mi , D)p(Mi |D).(3.67)i=1This is an example of a mixture distribution in which the overall predictive distribution is obtained by averaging the predictive distributions p(t|x, Mi , D) of individualmodels, weighted by the posterior probabilities p(Mi |D) of those models. For instance, if we have two models that are a-posteriori equally likely and one predictsa narrow distribution around t = a while the other predicts a narrow distributionaround t = b, the overall predictive distribution will be a bimodal distribution withmodes at t = a and t = b, not a single model at t = (a + b)/2.A simple approximation to model averaging is to use the single most probablemodel alone to make predictions.
This is known as model selection.For a model governed by a set of parameters w, the model evidence is given,from the sum and product rules of probability, by(3.68)p(D|Mi ) = p(D|w, Mi )p(w|Mi ) dw.Chapter 11From a sampling perspective, the marginal likelihood can be viewed as the probability of generating the data set D from a model whose parameters are sampled atrandom from the prior.
It is also interesting to note that the evidence is precisely thenormalizing term that appears in the denominator in Bayes’ theorem when evaluatingthe posterior distribution over parameters becausep(w|D, Mi ) =p(D|w, Mi )p(w|Mi ).p(D|Mi )(3.69)We can obtain some insight into the model evidence by making a simple approximation to the integral over parameters. Consider first the case of a model having asingle parameter w. The posterior distribution over parameters is proportional top(D|w)p(w), where we omit the dependence on the model Mi to keep the notationuncluttered.