Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 52
Текст из файла (страница 52)
In particular, evaluation of the posteriordistribution would require normalization of the product of a prior distribution and alikelihood function that itself comprises a product of logistic sigmoid functions, onefor every data point. Evaluation of the predictive distribution is similarly intractable.Here we consider the application of the Laplace approximation to the problem ofBayesian logistic regression (Spiegelhalter and Lauritzen, 1990; MacKay, 1992b).4.5.1 Laplace approximationRecall from Section 4.4 that the Laplace approximation is obtained by findingthe mode of the posterior distribution and then fitting a Gaussian centred at thatmode.
This requires evaluation of the second derivatives of the log posterior, whichis equivalent to finding the Hessian matrix.Because we seek a Gaussian representation for the posterior distribution, it isnatural to begin with a Gaussian prior, which we write in the general formp(w) = N (w|m0 , S0 )(4.140)2184. LINEAR MODELS FOR CLASSIFICATIONwhere m0 and S0 are fixed hyperparameters.
The posterior distribution over w isgiven byp(w|t) ∝ p(w)p(t|w)(4.141)where t = (t1 , . . . , tN )T . Taking the log of both sides, and substituting for the priordistribution using (4.140), and for the likelihood function using (4.89), we obtain11ln p(w|t) = − (w − m0 )T S−0 (w − m0 )2N+{tn ln yn + (1 − tn ) ln(1 − yn )} + const (4.142)n=1where yn = σ(wT φn ). To obtain a Gaussian approximation to the posterior distribution, we first maximize the posterior distribution to give the MAP (maximumposterior) solution wMAP , which defines the mean of the Gaussian. The covarianceis then given by the inverse of the matrix of second derivatives of the negative loglikelihood, which takes the form1SN = −∇∇ ln p(w|t) = S−0 +Nyn (1 − yn )φn φTn.(4.143)n=1The Gaussian approximation to the posterior distribution therefore takes the formq(w) = N (w|wMAP , SN ).(4.144)Having obtained a Gaussian approximation to the posterior distribution, thereremains the task of marginalizing with respect to this distribution in order to makepredictions.4.5.2 Predictive distributionThe predictive distribution for class C1 , given a new feature vector φ(x), isobtained by marginalizing with respect to the posterior distribution p(w|t), which isitself approximated by a Gaussian distribution q(w) so that(4.145)p(C1 |φ, t) = p(C1 |φ, w)p(w|t) dw σ(wT φ)q(w) dwwith the corresponding probability for class C2 given by p(C2 |φ, t) = 1 − p(C1 |φ, t).To evaluate the predictive distribution, we first note that the function σ(wT φ) depends on w only through its projection onto φ.
Denoting a = wT φ, we haveTσ(w φ) = δ(a − wT φ)σ(a) da(4.146)where δ(·) is the Dirac delta function. From this we obtainσ(wT φ)q(w) dw = σ(a)p(a) da(4.147)4.5. Bayesian Logistic Regression219wherep(a) =δ(a − wT φ)q(w) dw.(4.148)We can evaluate p(a) by noting that the delta function imposes a linear constrainton w and so forms a marginal distribution from the joint distribution q(w) by integrating out all directions orthogonal to φ. Because q(w) is Gaussian, we know fromSection 2.3.2 that the marginal distribution will also be Gaussian. We can evaluatethe mean and covariance of this distribution by taking moments, and interchangingthe order of integration over a and w, so thatTφ(4.149)µa = E[a] = p(a)a da = q(w)wT φ dw = wMAPwhere we have used the result (4.144) for the variational posterior distribution q(w).Similarly2σa = var[a] = p(a) a2 − E[a]2 da2q(w) (wT φ)2 − (mTdw = φT SN φ.(4.150)=N φ)Note that the distribution of a takes the same form as the predictive distribution(3.58) for the linear regression model, with the noise variance set to zero.
Thus ourvariational approximation to the predictive distribution becomes(4.151)p(C1 |t) = σ(a)p(a) da = σ(a)N (a|µa , σa2 ) da.Exercise 4.24Exercise 4.25Exercise 4.26This result can also be derived directly by making use of the results for the marginalof a Gaussian distribution given in Section 2.3.2.The integral over a represents the convolution of a Gaussian with a logistic sigmoid, and cannot be evaluated analytically.
We can, however, obtain a good approximation (Spiegelhalter and Lauritzen, 1990; MacKay, 1992b; Barber and Bishop,1998a) by making use of the close similarity between the logistic sigmoid functionσ(a) defined by (4.59) and the probit function Φ(a) defined by (4.114). In order toobtain the best approximation to the logistic function we need to re-scale the horizontal axis, so that we approximate σ(a) by Φ(λa). We can find a suitable value ofλ by requiring that the two functions have the same slope at the origin, which givesλ2 = π/8. The similarity of the logistic sigmoid and the probit function, for thischoice of λ, is illustrated in Figure 4.9.The advantage of using a probit function is that its convolution with a Gaussiancan be expressed analytically in terms of another probit function.
Specifically wecan show thatµ2Φ(λa)N (a|µ, σ ) da = Φ.(4.152)(λ−2 + σ 2 )1/22204. LINEAR MODELS FOR CLASSIFICATIONWe now apply the approximation σ(a) Φ(λa) to the probit functions appearingon both sides of this equation, leading to the following approximation for the convolution of a logistic sigmoid with a Gaussianσ(a)N (a|µ, σ 2 ) da σ κ(σ 2 )µ(4.153)where we have definedκ(σ 2 ) = (1 + πσ 2 /8)−1/2 .(4.154)Applying this result to (4.151) we obtain the approximate predictive distributionin the form(4.155)p(C1 |φ, t) = σ κ(σa2 )µawhere µa and σa2 are defined by (4.149) and (4.150), respectively, and κ(σa2 ) is defined by (4.154).Note that the decision boundary corresponding to p(C1 |φ, t) = 0.5 is given byµa = 0, which is the same as the decision boundary obtained by using the MAPvalue for w. Thus if the decision criterion is based on minimizing misclassification rate, with equal prior probabilities, then the marginalization over w has no effect.
However, for more complex decision criteria it will play an important role.Marginalization of the logistic sigmoid model under a Gaussian approximation tothe posterior distribution will be illustrated in the context of variational inference inFigure 10.13.Exercises4.1 ( ) Given a set of data points {xn }, we can define the convex hull to be the set ofall points x given byx=αn xn(4.156)nwhere αn 0 and n αn = 1.
Consider a second set of points {yn } together withtheir corresponding convex hull. By definition, the two sets of points will be linearly and a scalar w0 such that w T xn + w0 > 0 for allseparable if there exists a vector wTxn , and w yn + w0 < 0 for all yn . Show that if their convex hulls intersect, the twosets of points cannot be linearly separable, and conversely that if they are linearlyseparable, their convex hulls do not intersect.4.2 ( ) www Consider the minimization of a sum-of-squares error function (4.15),and suppose that all of the target vectors in the training set satisfy a linear constraintaT t n + b = 0(4.157)where tn corresponds to the nth row of the matrix T in (4.15). Show that as aconsequence of this constraint, the elements of the model prediction y(x) given bythe least-squares solution (4.17) also satisfy this constraint, so thataT y(x) + b = 0.(4.158)Exercises221To do so, assume that one of the basis functions φ0 (x) = 1 so that the correspondingparameter w0 plays the role of a bias.4.3 ( ) Extend the result of Exercise 4.2 to show that if multiple linear constraintsare satisfied simultaneously by the target vectors, then the same constraints will alsobe satisfied by the least-squares prediction of a linear model.4.4 () www Show that maximization of the class separation criterion given by (4.23)with respect to w, using a Lagrange multiplier to enforce the constraint wT w = 1,leads to the result that w ∝ (m2 − m1 ).4.5 () By making use of (4.20), (4.23), and (4.24), show that the Fisher criterion (4.25)can be written in the form (4.26).4.6 () Using the definitions of the between-class and within-class covariance matricesgiven by (4.27) and (4.28), respectively, together with (4.34) and (4.36) and thechoice of target values described in Section 4.1.5, show that the expression (4.33)that minimizes the sum-of-squares error function can be written in the form (4.37).4.7 () www Show that the logistic sigmoid function (4.59) satisfies the propertyσ(−a) = 1 − σ(a) and that its inverse is given by σ −1 (y) = ln {y/(1 − y)}.4.8 () Using (4.57) and (4.58), derive the result (4.65) for the posterior class probabilityin the two-class generative model with Gaussian densities, and verify the results(4.66) and (4.67) for the parameters w and w0 .4.9 () www Consider a generative classification model for K classes defined byprior class probabilities p(Ck ) = πk and general class-conditional densities p(φ|Ck )where φ is the input feature vector.