Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 73
Текст из файла (страница 73)
However, we have assumedthat the contribution to the predictive variance arising from the additive noise, governed by the parameter β, is a constant. For some problems, known as heteroscedastic, the noise variance itself will also depend on x. To model this, we can extend the3126. KERNEL METHODSFigure 6.9 Samples from the ARDprior for Gaussian processes, inwhich the kernel function is given by(6.71).
The left plot corresponds toη1 = η2 = 1, and the right plot corresponds to η1 = 1, η2 = 0.01.Gaussian process framework by introducing a second Gaussian process to representthe dependence of β on the input x (Goldberg et al., 1998). Because β is a variance,and hence nonnegative, we use the Gaussian process to model ln β(x).6.4.4 Automatic relevance determinationIn the previous section, we saw how maximum likelihood could be used to determine a value for the correlation length-scale parameter in a Gaussian process.This technique can usefully be extended by incorporating a separate parameter foreach input variable (Rasmussen and Williams, 2006).
The result, as we shall see, isthat the optimization of these parameters by maximum likelihood allows the relativeimportance of different inputs to be inferred from the data. This represents an example in the Gaussian process context of automatic relevance determination, or ARD,which was originally formulated in the framework of neural networks (MacKay,1994; Neal, 1996). The mechanism by which appropriate inputs are preferred isdiscussed in Section 7.2.2.Consider a Gaussian process with a two-dimensional input space x = (x1 , x2 ),having a kernel function of the form21k(x, x ) = θ0 exp −ηi (xi − xi )2 .(6.71)2i=1Samples from the resulting prior over functions y(x) are shown for two differentsettings of the precision parameters ηi in Figure 6.9.
We see that, as a particular parameter ηi becomes small, the function becomes relatively insensitive to thecorresponding input variable xi . By adapting these parameters to a data set usingmaximum likelihood, it becomes possible to detect input variables that have littleeffect on the predictive distribution, because the corresponding values of ηi will besmall. This can be useful in practice because it allows such inputs to be discarded.ARD is illustrated using a simple synthetic data set having three inputs x1 , x2 and x3(Nabney, 2002) in Figure 6.10.
The target variable t, is generated by sampling 100values of x1 from a Gaussian, evaluating the function sin(2πx1 ), and then adding6.4. Gaussian ProcessesFigure 6.10Illustration of automatic relevance determination in a Gaussian process for a synthetic problem having three inputs x1 , x2 ,and x3 , for which the curvesshow the corresponding values ofthe hyperparameters η1 (red), η2(green), and η3 (blue) as a function of the number of iterationswhen optimizing the marginallikelihood.
Details are given inthe text. Note the logarithmicscale on the vertical axis.313210010−210−410020406080100Gaussian noise. Values of x2 are given by copying the corresponding values of x1and adding noise, and values of x3 are sampled from an independent Gaussian distribution. Thus x1 is a good predictor of t, x2 is a more noisy predictor of t, and x3has only chance correlations with t. The marginal likelihood for a Gaussian processwith ARD parameters η1 , η2 , η3 is optimized using the scaled conjugate gradientsalgorithm. We see from Figure 6.10 that η1 converges to a relatively large value, η2converges to a much smaller value, and η3 becomes very small indicating that x3 isirrelevant for predicting t.The ARD framework is easily incorporated into the exponential-quadratic kernel(6.63) to give the following form of kernel function, which has been found useful forapplications of Gaussian processes to a range of regression problemsDD12ηi (xni − xmi ) + θ2 + θ3xni xmi (6.72)k(xn , xm ) = θ0 exp −2i=1i=1where D is the dimensionality of the input space.6.4.5 Gaussian processes for classificationIn a probabilistic approach to classification, our goal is to model the posteriorprobabilities of the target variable for a new input vector, given a set of trainingdata.
These probabilities must lie in the interval (0, 1), whereas a Gaussian processmodel makes predictions that lie on the entire real axis. However, we can easilyadapt Gaussian processes to classification problems by transforming the output ofthe Gaussian process using an appropriate nonlinear activation function.Consider first the two-class problem with a target variable t ∈ {0, 1}. If we define a Gaussian process over a function a(x) and then transform the function usinga logistic sigmoid y = σ(a), given by (4.59), then we will obtain a non-Gaussianstochastic process over functions y(x) where y ∈ (0, 1).
This is illustrated for thecase of a one-dimensional input space in Figure 6.11 in which the probability distri-3146. KERNEL METHODS10150.7500.5−50.25−10−1−0.500.510−1−0.500.51Figure 6.11 The left plot shows a sample from a Gaussian process prior over functions a(x), and the right plotshows the result of transforming this sample using a logistic sigmoid function.bution over the target variable t is then given by the Bernoulli distributionp(t|a) = σ(a)t (1 − σ(a))1−t .(6.73)As usual, we denote the training set inputs by x1 , . . . , xN with correspondingobserved target variables t = (t1 , .
. . , tN )T . We also consider a single test pointxN +1 with target value tN +1 . Our goal is to determine the predictive distributionp(tN +1 |t), where we have left the conditioning on the input variables implicit. To dothis we introduce a Gaussian process prior over the vector aN +1 , which has components a(x1 ), . . . , a(xN +1 ). This in turn defines a non-Gaussian process over tN +1 ,and by conditioning on the training data tN we obtain the required predictive distribution. The Gaussian process prior for aN +1 takes the formp(aN +1 ) = N (aN +1 |0, CN +1 ).(6.74)Unlike the regression case, the covariance matrix no longer includes a noise termbecause we assume that all of the training data points are correctly labelled.
However, for numerical reasons it is convenient to introduce a noise-like term governedby a parameter ν that ensures that the covariance matrix is positive definite. Thusthe covariance matrix CN +1 has elements given byC(xn , xm ) = k(xn , xm ) + νδnm(6.75)where k(xn , xm ) is any positive semidefinite kernel function of the kind consideredin Section 6.2, and the value of ν is typically fixed in advance. We shall assume thatthe kernel function k(x, x ) is governed by a vector θ of parameters, and we shalllater discuss how θ may be learned from the training data.For two-class problems, it is sufficient to predict p(tN +1 = 1|tN ) because thevalue of p(tN +1 = 0|tN ) is then given by 1 − p(tN +1 = 1|tN ). The required6.4.
Gaussian Processespredictive distribution is given byp(tN +1 = 1|tN ) = p(tN +1 = 1|aN +1 )p(aN +1 |tN ) daN +1Section 2.3Section 10.1Section 10.7315(6.76)where p(tN +1 = 1|aN +1 ) = σ(aN +1 ).This integral is analytically intractable, and so may be approximated using sampling methods (Neal, 1997). Alternatively, we can consider techniques based onan analytical approximation. In Section 4.5.2, we derived the approximate formula(4.153) for the convolution of a logistic sigmoid with a Gaussian distribution.
Wecan use this result to evaluate the integral in (6.76) provided we have a Gaussianapproximation to the posterior distribution p(aN +1 |tN ). The usual justification for aGaussian approximation to a posterior distribution is that the true posterior will tendto a Gaussian as the number of data points increases as a consequence of the centrallimit theorem. In the case of Gaussian processes, the number of variables grows withthe number of data points, and so this argument does not apply directly. However, ifwe consider increasing the number of data points falling in a fixed region of x space,then the corresponding uncertainty in the function a(x) will decrease, again leadingasymptotically to a Gaussian (Williams and Barber, 1998).Three different approaches to obtaining a Gaussian approximation have beenconsidered.
One technique is based on variational inference (Gibbs and MacKay,2000) and makes use of the local variational bound (10.144) on the logistic sigmoid.This allows the product of sigmoid functions to be approximated by a product ofGaussians thereby allowing the marginalization over aN to be performed analytically. The approach also yields a lower bound on the likelihood function p(tN |θ).The variational framework for Gaussian process classification can also be extendedto multiclass (K > 2) problems by using a Gaussian approximation to the softmaxfunction (Gibbs, 1997).A second approach uses expectation propagation (Opper and Winther, 2000b;Minka, 2001b; Seeger, 2003).
Because the true posterior distribution is unimodal, aswe shall see shortly, the expectation propagation approach can give good results.6.4.6 Laplace approximationSection 4.4The third approach to Gaussian process classification is based on the Laplaceapproximation, which we now consider in detail. In order to evaluate the predictivedistribution (6.76), we seek a Gaussian approximation to the posterior distributionover aN +1 , which, using Bayes’ theorem, is given byp(aN +1 |tN ) =p(aN +1 , aN |tN ) daN1=p(aN +1 , aN )p(tN |aN +1 , aN ) daNp(tN )1=p(aN +1 |aN )p(aN )p(tN |aN ) daNp(tN )=p(aN +1 |aN )p(aN |tN ) daN(6.77)3166.
KERNEL METHODSwhere we have used p(tN |aN +1 , aN ) = p(tN |aN ). The conditional distributionp(aN +1 |aN ) is obtained by invoking the results (6.66) and (6.67) for Gaussian process regression, to give1T −1p(aN +1 |aN ) = N (aN +1 |kT C−N aN , c − k CN k).(6.78)We can therefore evaluate the integral in (6.77) by finding a Laplace approximationfor the posterior distribution p(aN |tN ), and then using the standard result for theconvolution of two Gaussian distributions.The prior p(aN ) is given by a zero-mean Gaussian process with covariance matrix CN , and the data term (assuming independence of the data points) is given byp(tN |aN ) =Ntn1−tnσ(an ) (1 − σ(an ))n=1=Nean tn σ(−an ).(6.79)n=1We then obtain the Laplace approximation by Taylor expanding the logarithm ofp(aN |tN ), which up to an additive normalization constant is given by the quantityΨ(aN ) = ln p(aN ) + ln p(tN |aN )1N1ln(2π) − ln |CN | + tTC−1 aN −= − aTN aN2 N N22N−ln(1 + ean ) + const.(6.80)n=1First we need to find the mode of the posterior distribution, and this requires that weevaluate the gradient of Ψ(aN ), which is given by1∇Ψ(aN ) = tN − σ N − C−N aNSection 4.3.3where σ N is a vector with elements σ(an ).