Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 51
Текст из файла (страница 51)
The Laplace Approximation213Thus y and η must related, and we denote this relation through η = ψ(y).Following Nelder and Wedderburn (1972), we define a generalized linear modelto be one for which y is a nonlinear function of a linear combination of the input (orfeature) variables so thaty = f (wT φ)(4.120)where f (·) is known as the activation function in the machine learning literature, andf −1 (·) is known as the link function in statistics.Now consider the log likelihood function for this model, which, as a function ofη, is given byln p(t|η, s) =Nln p(tn |η, s) =n=1N ηn tn ln g(ηn ) ++ consts(4.121)n=1where we are assuming that all observations share a common scale parameter (whichcorresponds to the noise variance for a Gaussian distribution for instance) and so sis independent of n. The derivative of the log likelihood with respect to the modelparameters w is then given byN dtn dηn dynln g(ηn ) +∇an∇w ln p(t|η, s) =dηns dyn dann=1=N1n=1s{tn − yn } ψ (yn )f (an )φn(4.122)where an = wT φn , and we have used yn = f (an ) together with the result (4.119)for E[t|η].
We now see that there is a considerable simplification if we choose aparticular form for the link function f −1 (y) given byf −1 (y) = ψ(y)(4.123)which gives f (ψ(y)) = y and hence f (ψ)ψ (y) = 1. Also, because a = f −1 (y),we have a = ψ and hence f (a)ψ (y) = 1. In this case, the gradient of the errorfunction reduces toN1∇ ln E(w) ={yn − tn }φn .(4.124)sn=1For the Gaussian s = β−1, whereas for the logistic model s = 1.4.4. The Laplace ApproximationIn Section 4.5 we shall discuss the Bayesian treatment of logistic regression. Aswe shall see, this is more complex than the Bayesian treatment of linear regressionmodels, discussed in Sections 3.3 and 3.5. In particular, we cannot integrate exactly214Chapter 10Chapter 114.
LINEAR MODELS FOR CLASSIFICATIONover the parameter vector w since the posterior distribution is no longer Gaussian.It is therefore necessary to introduce some form of approximation. Later in thebook we shall consider a range of techniques based on analytical approximationsand numerical sampling.Here we introduce a simple, but widely used, framework called the Laplace approximation, that aims to find a Gaussian approximation to a probability densitydefined over a set of continuous variables. Consider first the case of a single continuous variable z, and suppose the distribution p(z) is defined byp(z) =1f (z)Z(4.125)where Z = f (z) dz is the normalization coefficient.
We shall suppose that thevalue of Z is unknown. In the Laplace method the goal is to find a Gaussian approximation q(z) which is centred on a mode of the distribution p(z). The first step is tofind a mode of p(z), in other words a point z0 such that p (z0 ) = 0, or equivalentlydf (z) = 0.(4.126)dz z=z0A Gaussian distribution has the property that its logarithm is a quadratic functionof the variables.
We therefore consider a Taylor expansion of ln f (z) centred on themode z0 so that1ln f (z) ln f (z0 ) − A(z − z0 )2(4.127)2whered2A = − 2 ln f (z).(4.128)dzz =z0Note that the first-order term in the Taylor expansion does not appear since z0 is alocal maximum of the distribution. Taking the exponential we obtainA(4.129)f (z) f (z0 ) exp − (z − z0 )2 .2We can then obtain a normalized distribution q(z) by making use of the standardresult for the normalization of a Gaussian, so thatq(z) =A2π1/2Aexp − (z − z0 )2 .2(4.130)The Laplace approximation is illustrated in Figure 4.14.
Note that the Gaussianapproximation will only be well defined if its precision A > 0, in other words thestationary point z0 must be a local maximum, so that the second derivative of f (z)at the point z0 is negative.2154.4. The Laplace Approximation0.8400.6300.4200.2100−2−1012340−2−101234Figure 4.14 Illustration of the Laplace approximation applied to the distribution p(z) ∝ exp(−z 2 /2)σ(20z + 4)where σ(z) is the logistic sigmoid function defined by σ(z) = (1 + e−z )−1 . The left plot shows the normalizeddistribution p(z) in yellow, together with the Laplace approximation centred on the mode z0 of p(z) in red. Theright plot shows the negative logarithms of the corresponding curves.We can extend the Laplace method to approximate a distribution p(z) = f (z)/Zdefined over an M -dimensional space z.
At a stationary point z0 the gradient ∇f (z)will vanish. Expanding around this stationary point we have1ln f (z) ln f (z0 ) − (z − z0 )T A(z − z0 )2(4.131)where the M × M Hessian matrix A is defined byA = − ∇∇ ln f (z)|z=z0(4.132)and ∇ is the gradient operator. Taking the exponential of both sides we obtain1Tf (z) f (z0 ) exp − (z − z0 ) A(z − z0 ) .(4.133)2The distribution q(z) is proportional to f (z) and the appropriate normalization coefficient can be found by inspection, using the standard result (2.43) for a normalizedmultivariate Gaussian, giving|A|1/21Tq(z) =exp−)A(z−z)= N (z|z0 , A−1 )(4.134)(z−z002(2π)M/2where |A| denotes the determinant of A. This Gaussian distribution will be welldefined provided its precision matrix, given by A, is positive definite, which impliesthat the stationary point z0 must be a local maximum, not a minimum or a saddlepoint.In order to apply the Laplace approximation we first need to find the mode z0 ,and then evaluate the Hessian matrix at that mode.
In practice a mode will typically be found by running some form of numerical optimization algorithm (Bishop2164. LINEAR MODELS FOR CLASSIFICATIONand Nabney, 2008). Many of the distributions encountered in practice will be multimodal and so there will be different Laplace approximations according to whichmode is being considered. Note that the normalization constant Z of the true distribution does not need to be known in order to apply the Laplace method. As a resultof the central limit theorem, the posterior distribution for a model is expected tobecome increasingly better approximated by a Gaussian as the number of observeddata points is increased, and so we would expect the Laplace approximation to bemost useful in situations where the number of data points is relatively large.One major weakness of the Laplace approximation is that, since it is based on aGaussian distribution, it is only directly applicable to real variables.
In other casesit may be possible to apply the Laplace approximation to a transformation of thevariable. For instance if 0 τ < ∞ then we can consider a Laplace approximationof ln τ . The most serious limitation of the Laplace framework, however, is thatit is based purely on the aspects of the true distribution at a specific value of thevariable, and so can fail to capture important global properties.
In Chapter 10 weshall consider alternative approaches which adopt a more global perspective.4.4.1 Model comparison and BICAs well as approximating the distribution p(z) we can also obtain an approximation to the normalization constant Z. Using the approximation (4.133) we haveZ =f (z) dz1T f (z0 ) exp − (z − z0 ) A(z − z0 ) dz2= f (z0 )(2π)M/2|A|1/2(4.135)where we have noted that the integrand is Gaussian and made use of the standardresult (2.43) for a normalized Gaussian distribution.
We can use the result (4.135) toobtain an approximation to the model evidence which, as discussed in Section 3.4,plays a central role in Bayesian model comparison.Consider a data set D and a set of models {Mi } having parameters {θ i }. Foreach model we define a likelihood function p(D|θ i , Mi ). If we introduce a priorp(θ i |Mi ) over the parameters, then we are interested in computing the model evidence p(D|Mi ) for the various models.
From now on we omit the conditioning onMi to keep the notation uncluttered. From Bayes’ theorem the model evidence isgiven byp(D) = p(D|θ)p(θ) dθ.(4.136)Exercise 4.22Identifying f (θ) = p(D|θ)p(θ) and Z = p(D), and applying the result (4.135), weobtain1Mln p(D) ln p(D|θ MAP ) + ln p(θ MAP ) +ln(2π) − ln |A|(4.137)22()*+Occam factor4.5. Bayesian Logistic Regression217where θ MAP is the value of θ at the mode of the posterior distribution, and A is theHessian matrix of second derivatives of the negative log posteriorA = −∇∇ ln p(D|θ MAP )p(θ MAP ) = −∇∇ ln p(θ MAP |D).Exercise 4.23The first term on the right hand side of (4.137) represents the log likelihood evaluated using the optimized parameters, while the remaining three terms comprise the‘Occam factor’ which penalizes model complexity.If we assume that the Gaussian prior distribution over parameters is broad, andthat the Hessian has full rank, then we can approximate (4.137) very roughly using1ln p(D) ln p(D|θ MAP ) − M ln N2Section 3.5.3(4.138)(4.139)where N is the number of data points, M is the number of parameters in θ andwe have omitted additive constants.
This is known as the Bayesian InformationCriterion (BIC) or the Schwarz criterion (Schwarz, 1978). Note that, compared toAIC given by (1.73), this penalizes model complexity more heavily.Complexity measures such as AIC and BIC have the virtue of being easy toevaluate, but can also give misleading results. In particular, the assumption that theHessian matrix has full rank is often not valid since many of the parameters are not‘well-determined’. We can use the result (4.137) to obtain a more accurate estimateof the model evidence starting from the Laplace approximation, as we illustrate inthe context of neural networks in Section 5.7.4.5. Bayesian Logistic RegressionWe now turn to a Bayesian treatment of logistic regression. Exact Bayesian inference for logistic regression is intractable.