Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 41
Текст из файла (страница 41)
Although the resulting integral over w is no longer analytically tractable,it might be thought that approximating this integral, for example using the Laplaceapproximation discussed (Section 4.4) which is based on a local Gaussian approximation centred on the mode of the posterior distribution, might provide a practicalalternative to the evidence framework (Buntine and Weigend, 1991). However, theintegrand as a function of w typically has a strongly skewed mode so that the Laplaceapproximation fails to capture the bulk of the probability mass, leading to poorer results than those obtained by maximizing the evidence (MacKay, 1999).Returning to the evidence framework, we note that there are two approaches thatwe can take to the maximization of the log evidence.
We can evaluate the evidencefunction analytically and then set its derivative equal to zero to obtain re-estimationequations for α and β, which we shall do in Section 3.5.2. Alternatively we use atechnique called the expectation maximization (EM) algorithm, which will be discussed in Section 9.3.4 where we shall also show that these two approaches convergeto the same solution.3.5.1 Evaluation of the evidence functionThe marginal likelihood function p(t|α, β) is obtained by integrating over theweight parameters w, so that(3.77)p(t|α, β) = p(t|w, β)p(w|α) dw.Exercise 3.16Exercise 3.17One way to evaluate this integral is to make use once again of the result (2.115)for the conditional distribution in a linear-Gaussian model.
Here we shall evaluatethe integral instead by completing the square in the exponent and making use of thestandard form for the normalization coefficient of a Gaussian.From (3.11), (3.12), and (3.52), we can write the evidence function in the formp(t|α, β) =β2πN/2 α M/22πexp {−E(w)} dw(3.78)3.5. The Evidence Approximation167where M is the dimensionality of w, and we have definedE(w) = βED (w) + αEW (w)βα2=t − Φw + wT w.22Exercise 3.18(3.79)We recognize (3.79) as being equal, up to a constant of proportionality, to the regularized sum-of-squares error function (3.27). We now complete the square over wgiving1E(w) = E(mN ) + (w − mN )T A(w − mN )(3.80)2where we have introduced(3.81)A = αI + βΦT Φtogether withβα2t − ΦmN + mTmN .(3.82)22 NNote that A corresponds to the matrix of second derivatives of the error functionE(mN ) =A = ∇∇E(w)(3.83)and is known as the Hessian matrix.
Here we have also defined mN given bymN = βA−1 ΦT t.Exercise 3.19(3.84)1Using (3.54), we see that A = S−N , and hence (3.84) is equivalent to the previousdefinition (3.53), and therefore represents the mean of the posterior distribution.The integral over w can now be evaluated simply by appealing to the standardresult for the normalization coefficient of a multivariate Gaussian, givingexp {−E(w)} dw1T= exp{−E(mN )} exp − (w − mN ) A(w − mN ) dw2= exp{−E(mN )}(2π)M/2 |A|−1/2 .(3.85)Using (3.78) we can then write the log of the marginal likelihood in the formln p(t|α, β) =MNN1ln α +ln β − E(mN ) − ln |A| −ln(2π)2222(3.86)which is the required expression for the evidence function.Returning to the polynomial regression problem, we can plot the model evidenceagainst the order of the polynomial, as shown in Figure 3.14.
Here we have assumeda prior of the form (1.65) with the parameter α fixed at α = 5 × 10−3 . The formof this plot is very instructive. Referring back to Figure 1.4, we see that the M = 0polynomial has very poor fit to the data and consequently gives a relatively low value1683. LINEAR MODELS FOR REGRESSIONFigure 3.14Plot of the model evidence versusthe order M , for the polynomial re- −18gression model, showing that theevidence favours the model with−20M = 3.−22−24−2602468Mfor the evidence. Going to the M = 1 polynomial greatly improves the data fit, andhence the evidence is significantly higher.
However, in going to M = 2, the datafit is improved only very marginally, due to the fact that the underlying sinusoidalfunction from which the data is generated is an odd function and so has no even termsin a polynomial expansion. Indeed, Figure 1.5 shows that the residual data error isreduced only slightly in going from M = 1 to M = 2. Because this richer modelsuffers a greater complexity penalty, the evidence actually falls in going from M = 1to M = 2. When we go to M = 3 we obtain a significant further improvement indata fit, as seen in Figure 1.4, and so the evidence is increased again, giving thehighest overall evidence for any of the polynomials.
Further increases in the valueof M produce only small improvements in the fit to the data but suffer increasingcomplexity penalty, leading overall to a decrease in the evidence values. Lookingagain at Figure 1.5, we see that the generalization error is roughly constant betweenM = 3 and M = 8, and it would be difficult to choose between these models onthe basis of this plot alone. The evidence values, however, show a clear preferencefor M = 3, since this is the simplest model which gives a good explanation for theobserved data.3.5.2 Maximizing the evidence functionLet us first consider the maximization of p(t|α, β) with respect to α. This canbe done by first defining the following eigenvector equation T βΦ Φ ui = λi ui .(3.87)From (3.81), it then follows that A has eigenvalues α + λi .
Now consider the derivative of the term involving ln |A| in (3.86) with respect to α. We have 1d ddln(λi + α) =ln |A| =ln (λi + α) =.(3.88)dαdαdαλi + αiiiThus the stationary points of (3.86) with respect to α satisfy11 1M− mT.0=N mN −2α 22λi + αi(3.89)3.5. The Evidence ApproximationMultiplying through by 2α and rearranging, we obtain 1= γ.αmTN mN = M − αλi + α169(3.90)iSince there are M terms in the sum over i, the quantity γ can be written λiγ=.α + λi(3.91)iExercise 3.20The interpretation of the quantity γ will be discussed shortly. From (3.90) we seethat the value of α that maximizes the marginal likelihood satisfiesγα= T.(3.92)mN mNNote that this is an implicit solution for α not only because γ depends on α, but alsobecause the mode mN of the posterior distribution itself depends on the choice ofα.
We therefore adopt an iterative procedure in which we make an initial choice forα and use this to find mN , which is given by (3.53), and also to evaluate γ, whichis given by (3.91). These values are then used to re-estimate α using (3.92), and theprocess repeated until convergence.
Note that because the matrix ΦT Φ is fixed, wecan compute its eigenvalues once at the start and then simply multiply these by β toobtain the λi .It should be emphasized that the value of α has been determined purely by looking at the training data. In contrast to maximum likelihood methods, no independentdata set is required in order to optimize the model complexity.We can similarly maximize the log marginal likelihood (3.86) with respect to β.To do this, we note that the eigenvalues λi defined by (3.87) are proportional to β,and hence dλi /dβ = λi /β givingdd γ1 λiln |A| == .(3.93)ln(λi + α) =dβdββλi + αβiiThe stationary point of the marginal likelihood therefore satisfies0=N2N1 γ−tn − mT−N φ(xn )2β22β(3.94)n=1Exercise 3.22and rearranging we obtainN211 =tn − m T.N φ(xn )βN −γ(3.95)n=1Again, this is an implicit solution for β and can be solved by choosing an initialvalue for β and then using this to calculate mN and γ and then re-estimate β using(3.95), repeating until convergence.
If both α and β are to be determined from thedata, then their values can be re-estimated together after each update of γ.1703. LINEAR MODELS FOR REGRESSIONFigure 3.15 Contours of the likelihood function (red)and the prior (green) in which the axes in parameterspace have been rotated to align with the eigenvectorsui of the Hessian. For α = 0, the mode of the posterior is given by the maximum likelihood solution wML ,whereas for nonzero α the mode is at wMAP = mN . Inthe direction w1 the eigenvalue λ1 , defined by (3.87), issmall compared with α and so the quantity λ1 /(λ1 + α)is close to zero, and the corresponding MAP value ofw1 is also close to zero. By contrast, in the direction w2the eigenvalue λ2 is large compared with α and so thequantity λ2 /(λ2 +α) is close to unity, and the MAP valueof w2 is close to its maximum likelihood value.w2u2wMLwMAPu1w13.5.3 Effective number of parametersThe result (3.92) has an elegant interpretation (MacKay, 1992a), which providesinsight into the Bayesian solution for α.