Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 74
Текст из файла (страница 74)
We cannot simply find the mode bysetting this gradient to zero, because σ N depends nonlinearly on aN , and so weresort to an iterative scheme based on the Newton-Raphson method, which gives riseto an iterative reweighted least squares (IRLS) algorithm. This requires the secondderivatives of Ψ(aN ), which we also require for the Laplace approximation anyway,and which are given by1∇∇Ψ(aN ) = −WN − C−NExercise 6.24(6.81)(6.82)where WN is a diagonal matrix with elements σ(an )(1 − σ(an )), and we have usedthe result (4.88) for the derivative of the logistic sigmoid function.
Note that thesediagonal elements lie in the range (0, 1/4), and hence WN is a positive definitematrix. Because CN (and hence its inverse) is positive definite by construction, andbecause the sum of two positive definite matrices is also positive definite, we seethat the Hessian matrix A = −∇∇Ψ(aN ) is positive definite and so the posteriordistribution p(aN |tN ) is log convex and therefore has a single mode that is the global6.4. Gaussian ProcessesExercise 6.25317maximum.
The posterior distribution is not Gaussian, however, because the Hessianis a function of aN .Using the Newton-Raphson formula (4.92), the iterative update equation for aNis given by−1{tN − σ N + WN aN } .anewN = CN (I + WN CN )(6.83)These equations are iterated until they converge to the mode which we denote byaN . At the mode, the gradient ∇Ψ(aN ) will vanish, and hence aN will satisfyaN = CN (tN − σ N ).(6.84)Once we have found the mode aN of the posterior, we can evaluate the Hessianmatrix given by1(6.85)H = −∇∇Ψ(aN ) = WN + C−Nwhere the elements of WN are evaluated using aN .
This defines our Gaussian approximation to the posterior distribution p(aN |tN ) given byq(aN ) = N (aN |aN , H−1 ).Exercise 6.26(6.86)We can now combine this with (6.78) and hence evaluate the integral (6.77). Becausethis corresponds to a linear-Gaussian model, we can use the general result (2.115) togiveE[aN +1 |tN ] = kT (tN − σ N )−1+ CN )−1 k.var[aN +1 |tN ] = c − kT (WN(6.87)(6.88)Now that we have a Gaussian distribution for p(aN +1 |tN ), we can approximatethe integral (6.76) using the result (4.153).
As with the Bayesian logistic regressionmodel of Section 4.5, if we are only interested in the decision boundary corresponding to p(tN +1 |tN ) = 0.5, then we need only consider the mean and we can ignorethe effect of the variance.We also need to determine the parameters θ of the covariance function. Oneapproach is to maximize the likelihood function given by p(tN |θ) for which we needexpressions for the log likelihood and its gradient.
If desired, suitable regularizationterms can also be added, leading to a penalized maximum likelihood solution. Thelikelihood function is defined byp(tN |θ) = p(tN |aN )p(aN |θ) daN .(6.89)This integral is analytically intractable, so again we make use of the Laplace approximation. Using the result (4.135), we obtain the following approximation for the logof the likelihood functionln p(tN |θ) = Ψ(aN ) −1N1ln |WN + C−ln(2π)N |+22(6.90)3186. KERNEL METHODSwhere Ψ(aN ) = ln p(aN |θ) + ln p(tN |aN ). We also need to evaluate the gradientof ln p(tN |θ) with respect to the parameter vector θ.
Note that changes in θ willcause changes in aN , leading to additional terms in the gradient. Thus, when wedifferentiate (6.90) with respect to θ, we obtain two sets of terms, the first arisingfrom the dependence of the covariance matrix CN on θ, and the rest arising fromdependence of aN on θ.The terms arising from the explicit dependence on θ can be found by using(6.80) together with the results (C.21) and (C.22), and are given by∂ ln p(tN |θ)∂θj=1 T −1 ∂CN −1 a CC a2 N N ∂θj N N∂CN1−1− Tr (I + CN WN ) WN.2∂θj(6.91)To compute the terms arising from the dependence of aN on θ, we note thatthe Laplace approximation has been constructed such that Ψ(aN ) has zero gradientat aN = aN , and so Ψ(aN ) gives no contribution to the gradient as a result of itsdependence on aN .
This leaves the following contribution to the derivative withrespect to a component θj of θ11 ∂ ln |WN + C−N | ∂an−2∂an∂θjNn=11 ∂a(I + CN WN )−1 CN nn σn (1 − σn )(1 − 2σn ) n2∂θjN=−(6.92)n=1where σn = σ(an ), and again we have used the result (C.22) together with thedefinition of WN . We can evaluate the derivative of aN with respect to θj by differentiating the relation (6.84) with respect to θj to give∂an∂a∂CN=(tN − σ N ) − CN WN n .∂θj∂θj∂θj(6.93)Rearranging then gives∂an∂CN= (I + WN CN )−1(tN − σ N ).∂θj∂θjAppendix A(6.94)Combining (6.91), (6.92), and (6.94), we can evaluate the gradient of the loglikelihood function, which can be used with standard nonlinear optimization algorithms in order to determine a value for θ.We can illustrate the application of the Laplace approximation for Gaussian processes using the synthetic two-class data set shown in Figure 6.12.
Extension of theLaplace approximation to Gaussian processes involving K > 2 classes, using thesoftmax activation function, is straightforward (Williams and Barber, 1998).6.4. Gaussian Processes31920−2−202Figure 6.12 Illustration of the use of a Gaussian process for classification, showing the data on the left togetherwith the optimal decision boundary from the true distribution in green, and the decision boundary from theGaussian process classifier in black. On the right is the predicted posterior probability for the blue and redclasses together with the Gaussian process decision boundary.6.4.7 Connection to neural networksWe have seen that the range of functions which can be represented by a neuralnetwork is governed by the number M of hidden units, and that, for sufficientlylarge M , a two-layer network can approximate any given function with arbitraryaccuracy.
In the framework of maximum likelihood, the number of hidden unitsneeds to be limited (to a level dependent on the size of the training set) in orderto avoid over-fitting. However, from a Bayesian perspective it makes little sense tolimit the number of parameters in the network according to the size of the trainingset.In a Bayesian neural network, the prior distribution over the parameter vectorw, in conjunction with the network function f (x, w), produces a prior distributionover functions from y(x) where y is the vector of network outputs. Neal (1996)has shown that, for a broad class of prior distributions over w, the distribution offunctions generated by a neural network will tend to a Gaussian process in the limitM → ∞.
It should be noted, however, that in this limit the output variables of theneural network become independent. One of the great merits of neural networks isthat the outputs share the hidden units and so they can ‘borrow statistical strength’from each other, that is, the weights associated with each hidden unit are influencedby all of the output variables not just by one of them. This property is therefore lostin the Gaussian process limit.We have seen that a Gaussian process is determined by its covariance (kernel)function. Williams (1998) has given explicit forms for the covariance in the case oftwo specific choices for the hidden unit activation function (probit and Gaussian).These kernel functions k(x, x ) are nonstationary, i.e. they cannot be expressed asa function of the difference x − x , as a consequence of the Gaussian weight priorbeing centred on zero which breaks translation invariance in weight space.3206.
KERNEL METHODSBy working directly with the covariance function we have implicitly marginalized over the distribution of weights. If the weight prior is governed by hyperparameters, then their values will determine the length scales of the distribution overfunctions, as can be understood by studying the examples in Figure 5.11 for the caseof a finite number of hidden units.
Note that we cannot marginalize out the hyperparameters analytically, and must instead resort to techniques of the kind discussed inSection 6.4.Exercises6.1 ( ) www Consider the dual formulation of the least squares linear regressionproblem given in Section 6.1. Show that the solution for the components an ofthe vector a can be expressed as a linear combination of the elements of the vectorφ(xn ).
Denoting these coefficients by the vector w, show that the dual of the dualformulation is given by the original representation in terms of the parameter vectorw.6.2 ( ) In this exercise, we develop a dual formulation of the perceptron learningalgorithm. Using the perceptron learning rule (4.55), show that the learned weightvector w can be written as a linear combination of the vectors tn φ(xn ) where tn ∈{−1, +1}.
Denote the coefficients of this linear combination by αn and derive aformulation of the perceptron learning algorithm, and the predictive function for theperceptron, in terms of the αn . Show that the feature vector φ(x) enters only in theform of the kernel function k(x, x ) = φ(x)T φ(x ).6.3 () The nearest-neighbour classifier (Section 2.5.2) assigns a new input vector xto the same class as that of the nearest input vector xn from the training set, wherein the simplest case, the distance is defined by the Euclidean metric x − xn 2 . Byexpressing this rule in terms of scalar products and then making use of kernel substitution, formulate the nearest-neighbour classifier for a general nonlinear kernel.6.4 () In Appendix C, we give an example of a matrix that has positive elements butthat has a negative eigenvalue and hence that is not positive definite.
Find an exampleof the converse property, namely a 2 × 2 matrix with positive eigenvalues yet thathas at least one negative element.6.5 () wwwVerify the results (6.13) and (6.14) for constructing valid kernels.6.6 () Verify the results (6.15) and (6.16) for constructing valid kernels.6.7 () wwwVerify the results (6.17) and (6.18) for constructing valid kernels.6.8 () Verify the results (6.19) and (6.20) for constructing valid kernels.6.9 () Verify the results (6.21) and (6.22) for constructing valid kernels.6.10 () Show that an excellent choice of kernel for learning a function f (x) is givenby k(x, x ) = f (x)f (x ) by showing that a linear learning machine based on thiskernel will always find a solution proportional to f (x).Exercises3216.11 () By making use of the expansion (6.25), and then expanding the middle factoras a power series, show that the Gaussian kernel (6.23) can be expressed as the innerproduct of an infinite-dimensional feature vector.6.12 ( ) www Consider the space of all possible subsets A of a given fixed set D.Show that the kernel function (6.27) corresponds to an inner product in a featurespace of dimensionality 2|D| defined by the mapping φ(A) where A is a subset of Dand the element φU (A), indexed by the subset U , is given by1, if U ⊆ A;φU (A) =(6.95)0, otherwise.Here U ⊆ A denotes that U is either a subset of A or is equal to A.6.13 () Show that the Fisher kernel, defined by (6.33), remains invariant if we makea nonlinear transformation of the parameter vector θ → ψ(θ), where the functionψ(·) is invertible and differentiable.6.14 () www Write down the form of the Fisher kernel, defined by (6.33), for thecase of a distribution p(x|µ) = N (x|µ, S) that is Gaussian with mean µ and fixedcovariance S.6.15 () By considering the determinant of a 2 × 2 Gram matrix, show that a positivedefinite kernel function k(x, x ) satisfies the Cauchy-Schwartz inequalityk(x1 , x2 )2 k(x1 , x1 )k(x2 , x2 ).(6.96)6.16 ( ) Consider a parametric model governed by the parameter vector w togetherwith a data set of input values x1 , .