Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 66
Текст из файла (страница 66)
From (5.165), this is given byA = −∇∇ ln p(w|D, α, β) = αI + βH(5.166)where H is the Hessian matrix comprising the second derivatives of the sum-ofsquares error function with respect to the components of w. Algorithms for computing and approximating the Hessian were discussed in Section 5.4. The correspondingGaussian approximation to the posterior is then given from (4.134) byq(w|D) = N (w|wMAP , A−1 ).(5.167)Similarly, the predictive distribution is obtained by marginalizing with respectto this posterior distribution(5.168)p(t|x, D) = p(t|x, w)q(w|D) dw.However, even with the Gaussian approximation to the posterior, this integration isstill analytically intractable due to the nonlinearity of the network function y(x, w)as a function of w. To make progress, we now assume that the posterior distributionhas small variance compared with the characteristic scales of w over which y(x, w)is varying.
This allows us to make a Taylor series expansion of the network functionaround wMAP and retain only the linear termsy(x, w) y(x, wMAP ) + gT (w − wMAP )where we have definedg = ∇w y(x, w)|w=wMAP .(5.169)(5.170)With this approximation, we now have a linear-Gaussian model with a Gaussiandistribution for p(w) and a Gaussian for p(t|w) whose mean is a linear function ofw of the form(5.171)p(t|x, w, β) N t|y(x, wMAP ) + gT (w − wMAP ), β −1 .Exercise 5.38We can therefore make use of the general result (2.115) for the marginal p(t) to give(5.172)p(t|x, D, α, β) = N t|y(x, wMAP ), σ 2 (x)2805. NEURAL NETWORKSwhere the input-dependent variance is given byσ 2 (x) = β −1 + gT A−1 g.(5.173)We see that the predictive distribution p(t|x, D) is a Gaussian whose mean is givenby the network function y(x, wMAP ) with the parameter set to their MAP value.
Thevariance has two terms, the first of which arises from the intrinsic noise on the targetvariable, whereas the second is an x-dependent term that expresses the uncertaintyin the interpolant due to the uncertainty in the model parameters w. This shouldbe compared with the corresponding predictive distribution for the linear regressionmodel, given by (3.58) and (3.59).5.7.2 Hyperparameter optimizationSo far, we have assumed that the hyperparameters α and β are fixed and known.We can make use of the evidence framework, discussed in Section 3.5, together withthe Gaussian approximation to the posterior obtained using the Laplace approximation, to obtain a practical procedure for choosing the values of such hyperparameters.The marginal likelihood, or evidence, for the hyperparameters is obtained byintegrating over the network weights(5.174)p(D|α, β) = p(D|w, β)p(w|α) dw.Exercise 5.39This is easily evaluated by making use of the Laplace approximation result (4.135).Taking logarithms then givesln p(D|α, β) −E(wMAP ) −1WNNln |A| +ln α +ln β −ln(2π) (5.175)2222where W is the total number of parameters in w, and the regularized error functionis defined byE(wMAP ) =Nβα T2{y(xn , wMAP ) − tn } + wMAPwMAP .22(5.176)n=1We see that this takes the same form as the corresponding result (3.86) for the linearregression model.In the evidence framework, we make point estimates for α and β by maximizingln p(D|α, β).
Consider first the maximization with respect to α, which can be doneby analogy with the linear regression case discussed in Section 3.5.2. We first definethe eigenvalue equation(5.177)βHui = λi uiwhere H is the Hessian matrix comprising the second derivatives of the sum-ofsquares error function, evaluated at w = wMAP . By analogy with (3.92), we obtainα=γTwMAPwMAP(5.178)5.7. Bayesian Neural NetworksSection 3.5.3281where γ represents the effective number of parameters and is defined byγ=Wi=1λi.α + λi(5.179)Note that this result was exact for the linear regression case.
For the nonlinear neuralnetwork, however, it ignores the fact that changes in α will cause changes in theHessian H, which in turn will change the eigenvalues. We have therefore implicitlyignored terms involving the derivatives of λi with respect to α.Similarly, from (3.95) we see that maximizing the evidence with respect to βgives the re-estimation formulaN11 {y(xn , wMAP ) − tn }2 .=βN −γ(5.180)n=1Section 5.1.1As with the linear model, we need to alternate between re-estimation of the hyperparameters α and β and updating of the posterior distribution.
The situation witha neural network model is more complex, however, due to the multimodality of theposterior distribution. As a consequence, the solution for wMAP found by maximizing the log posterior will depend on the initialization of w. Solutions that differ onlyas a consequence of the interchange and sign reversal symmetries in the hidden unitsare identical so far as predictions are concerned, and it is irrelevant which of theequivalent solutions is found. However, there may be inequivalent solutions as well,and these will generally yield different values for the optimized hyperparameters.In order to compare different models, for example neural networks having different numbers of hidden units, we need to evaluate the model evidence p(D). This canbe approximated by taking (5.175) and substituting the values of α and β obtainedfrom the iterative optimization of these hyperparameters.
A more careful evaluationis obtained by marginalizing over α and β, again by making a Gaussian approximation (MacKay, 1992c; Bishop, 1995a). In either case, it is necessary to evaluate thedeterminant |A| of the Hessian matrix. This can be problematic in practice becausethe determinant, unlike the trace, is sensitive to the small eigenvalues that are oftendifficult to determine accurately.The Laplace approximation is based on a local quadratic expansion around amode of the posterior distribution over weights.
We have seen in Section 5.1.1 thatany given mode in a two-layer network is a member of a set of M !2M equivalentmodes that differ by interchange and sign-change symmetries, where M is the number of hidden units. When comparing networks having different numbers of hidden units, this can be taken into account by multiplying the evidence by a factor ofM !2M .5.7.3 Bayesian neural networks for classificationSo far, we have used the Laplace approximation to develop a Bayesian treatment of neural network regression models.
We now discuss the modifications to2825. NEURAL NETWORKSExercise 5.40this framework that arise when it is applied to classification. Here we shall consider a network having a single logistic sigmoid output corresponding to a two-classclassification problem. The extension to networks with multiclass softmax outputsis straightforward. We shall build extensively on the analogous results for linearclassification models discussed in Section 4.5, and so we encourage the reader tofamiliarize themselves with that material before studying this section.The log likelihood function for this model is given byln p(D|w) == 1N {tn ln yn + (1 − tn ) ln(1 − yn )}(5.181)nExercise 5.41where tn ∈ {0, 1} are the target values, and yn ≡ y(xn , w). Note that there is nohyperparameter β, because the data points are assumed to be correctly labelled.
Asbefore, the prior is taken to be an isotropic Gaussian of the form (5.162).The first stage in applying the Laplace framework to this model is to initializethe hyperparameter α, and then to determine the parameter vector w by maximizingthe log posterior distribution. This is equivalent to minimizing the regularized errorfunctionα(5.182)E(w) = − ln p(D|w) + wT w2and can be achieved using error backpropagation combined with standard optimization algorithms, as discussed in Section 5.3.Having found a solution wMAP for the weight vector, the next step is to evaluate the Hessian matrix H comprising the second derivatives of the negative loglikelihood function.
This can be done, for instance, using the exact method of Section 5.4.5, or using the outer product approximation given by (5.85). The secondderivatives of the negative log posterior can again be written in the form (5.166), andthe Gaussian approximation to the posterior is then given by (5.167).To optimize the hyperparameter α, we again maximize the marginal likelihood,which is easily shown to take the formln p(D|α) −E(wMAP ) −1Wln |A| +ln α + const22(5.183)where the regularized error function is defined byE(wMAP ) = −Nn=1{tn ln yn + (1 − tn ) ln(1 − yn )} +α TwwMAP (5.184)2 MAPin which yn ≡ y(xn , wMAP ).
Maximizing this evidence function with respect to αagain leads to the re-estimation equation given by (5.178).The use of the evidence procedure to determine α is illustrated in Figure 5.22for the synthetic two-dimensional data discussed in Appendix A.Finally, we need the predictive distribution, which is defined by (5.168). Again,this integration is intractable due to the nonlinearity of the network function. The2835.7. Bayesian Neural NetworksFigure 5.22Illustration of the evidence framework3applied to a synthetic two-class data set.The green curve shows the optimal decision boundary, the black curve shows 2the result of fitting a two-layer networkwith 8 hidden units by maximum likeli- 1hood, and the red curve shows the result of including a regularizer in which 0α is optimized using the evidence procedure, starting from the initial value −1α = 0. Note that the evidence procedure greatly reduces the over-fitting of −2the network.−2−1012simplest approximation is to assume that the posterior distribution is very narrowand hence make the approximationp(t|x, D) p(t|x, wMAP ).(5.185)We can improve on this, however, by taking account of the variance of the posteriordistribution.
In this case, a linear approximation for the network outputs, as was usedin the case of regression, would be inappropriate due to the logistic sigmoid outputunit activation function that constrains the output to lie in the range (0, 1). Instead,we make a linear approximation for the output unit activation in the forma(x, w) aMAP (x) + bT (w − wMAP )(5.186)where aMAP (x) = a(x, wMAP ), and the vector b ≡ ∇a(x, wMAP ) can be found bybackpropagation.Because we now have a Gaussian approximation for the posterior distributionover w, and a model for a that is a linear function of w, we can now appeal to theresults of Section 4.5.2. The distribution of output unit activation values, induced bythe distribution over network weights, is given byp(a|x, D) = δ a − aMAP (x) − bT (x)(w − wMAP ) q(w|D) dw (5.187)where q(w|D) is the Gaussian approximation to the posterior distribution given by(5.167).