Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 72
Текст из файла (страница 72)
. . , θ3 in Figure 6.5, and Figure 6.6 shows a set of points sampled from the joint distribution (6.60) along with the corresponding values definedby (6.61).So far, we have used the Gaussian process viewpoint to build a model of thejoint distribution over sets of data points. Our goal in regression, however, is tomake predictions of the target variables for new inputs, given a set of training data.Let us suppose that tN = (t1 , .
. . , tN )T , corresponding to input values x1 , . . . , xN ,comprise the observed training set, and our goal is to predict the target variable tN +1for a new input vector xN +1 . This requires that we evaluate the predictive distribution p(tN +1 |tN ). Note that this distribution is conditioned also on the variablesx1 , .
. . , xN and xN +1 . However, to keep the notation simple we will not show theseconditioning variables explicitly.To find the conditional distribution p(tN +1 |t), we begin by writing down thejoint distribution p(tN +1 ), where tN +1 denotes the vector (t1 , . . . , tN , tN +1 )T . Wethen apply the results from Section 2.3.1 to obtain the required conditional distribution, as illustrated in Figure 6.7.From (6.61), the joint distribution over t1 , .
. . , tN +1 will be given byp(tN +1 ) = N (tN +1 |0, CN +1 )(6.64)where CN +1 is an (N + 1) × (N + 1) covariance matrix with elements given by(6.62). Because this joint distribution is Gaussian, we can apply the results fromSection 2.3.1 to find the conditional Gaussian distribution. To do this, we partitionthe covariance matrix as followsCN k(6.65)CN +1 =kT cwhere CN is the N × N covariance matrix with elements given by (6.62) for n, m =1, .
. . , N , the vector k has elements k(xn , xN +1 ) for n = 1, . . . , N , and the scalar3086. KERNEL METHODS(1.00, 4.00, 0.00, 0.00)(9.00, 4.00, 0.00, 0.00)(1.00, 64.00, 0.00, 0.00)3931.54.51.5000−1.5−4.5−1.5−3−1−0.500.51−9−1(1.00, 0.25, 0.00, 0.00)−0.500.5−3−11941.54.52000−1.5−4.5−2−0.500.51−9−1−0.500.500.51(1.00, 4.00, 0.00, 5.00)(1.00, 4.00, 10.00, 0.00)3−3−1−0.5−4−11−0.500.51Figure 6.5 Samples from a Gaussian process prior defined by the covariance function (6.63). The title aboveeach plot denotes (θ0 , θ1 , θ2 , θ3 ).c = k(xN +1 , xN +1 ) + β −1 .
Using the results (2.81) and (2.82), we see that the conditional distribution p(tN +1 |t) is a Gaussian distribution with mean and covariancegiven by1m(xN +1 ) = kT C−N t(6.66)σ (xN +1 ) = c − k(6.67)2T1C−N k.These are the key results that define Gaussian process regression. Because the vectork is a function of the test point input value xN +1 , we see that the predictive distribution is a Gaussian whose mean and variance both depend on xN +1 . An example ofGaussian process regression is shown in Figure 6.8.The only restriction on the kernel function is that the covariance matrix given by(6.62) must be positive definite.
If λi is an eigenvalue of K, then the correspondingeigenvalue of C will be λi + β −1 . It is therefore sufficient that the kernel matrixk(xn , xm ) be positive semidefinite for any pair of points xn and xm , so that λi 0,because any eigenvalue λi that is zero will still give rise to a positive eigenvaluefor C because β > 0. This is the same restriction on the kernel function discussedearlier, and so we can again exploit all of the techniques in Section 6.2 to construct3096.4. Gaussian ProcessesFigure 6.6Illustration of the sampling of datapoints {tn } from a Gaussian process.The blue curve shows a sample function from the Gaussian process priorover functions, and the red pointsshow the values of yn obtained byevaluating the function at a set of input values {xn }.
The corresponding values of {tn }, shown in green,are obtained by adding independentGaussian noise to each of the {yn }.3t0−3−10x1suitable kernels.Note that the mean (6.66) of the predictive distribution can be written, as a function of xN +1 , in the formm(xN +1 ) =Nan k(xn , xN +1 )(6.68)n=1Exercise 6.211where an is the nth component of C−N t.
Thus, if the kernel function k(xn , xm )depends only on the distance xn − xm , then we obtain an expansion in radialbasis functions.The results (6.66) and (6.67) define the predictive distribution for Gaussian process regression with an arbitrary kernel function k(xn , xm ). In the particular case inwhich the kernel function k(x, x ) is defined in terms of a finite set of basis functions,we can derive the results obtained previously in Section 3.3.2 for linear regressionstarting from the Gaussian process viewpoint.For such models, we can therefore obtain the predictive distribution either bytaking a parameter space viewpoint and using the linear regression result or by takinga function space viewpoint and using the Gaussian process result.The central computational operation in using Gaussian processes will involvethe inversion of a matrix of size N × N , for which standard methods require O(N 3 )computations. By contrast, in the basis function model we have to invert a matrixSN of size M × M , which has O(M 3 ) computational complexity.
Note that forboth viewpoints, the matrix inversion must be performed once for the given trainingset. For each new test point, both methods require a vector-matrix multiply, whichhas cost O(N 2 ) in the Gaussian process case and O(M 2 ) for the linear basis function model. If the number M of basis functions is smaller than the number N ofdata points, it will be computationally more efficient to work in the basis function3106. KERNEL METHODSFigure 6.7Illustration of the mechanism ofGaussian process regression forthe case of one training point andone test point, in which the red ellipses show contours of the joint distribution p(t1 , t2 ). Here t1 is thetraining data point, and conditioning on the value of t1 , corresponding to the vertical blue line, we obtain p(t2 |t1 ) shown as a function oft2 by the green curve.t21m(x2 )0t1−1−1Exercise 6.23Figure 6.801framework.
However, an advantage of a Gaussian processes viewpoint is that wecan consider covariance functions that can only be expressed in terms of an infinitenumber of basis functions.For large training data sets, however, the direct application of Gaussian processmethods can become infeasible, and so a range of approximation schemes have beendeveloped that have better scaling with training set size than the exact approach(Gibbs, 1997; Tresp, 2001; Smola and Bartlett, 2001; Williams and Seeger, 2001;Csató and Opper, 2002; Seeger et al., 2003). Practical issues in the application ofGaussian processes are discussed in Bishop and Nabney (2008).We have introduced Gaussian process regression for the case of a single target variable. The extension of this formalism to multiple target variables, knownas co-kriging (Cressie, 1993), is straightforward.
Various other extensions of GausIllustration of Gaussian process regression applied to the sinusoidal1data set in Figure A.6 in which thethree right-most data points havebeen omitted. The green curve 0.5shows the sinusoidal function fromwhich the data points, shown in0blue, are obtained by sampling andaddition of Gaussian noise. Thered line shows the mean of the −0.5Gaussian process predictive distribution, and the shaded region cor−1responds to plus and minus twostandard deviations. Notice howthe uncertainty increases in the re0gion to the right of the data points.0.20.40.60.816.4. Gaussian Processes311sian process regression have also been considered, for purposes such as modellingthe distribution over low-dimensional manifolds for unsupervised learning (Bishopet al., 1998a) and the solution of stochastic differential equations (Graepel, 2003).6.4.3 Learning the hyperparametersSection 3.5The predictions of a Gaussian process model will depend, in part, on the choiceof covariance function.
In practice, rather than fixing the covariance function, wemay prefer to use a parametric family of functions and then infer the parametervalues from the data. These parameters govern such things as the length scale of thecorrelations and the precision of the noise and correspond to the hyperparameters ina standard parametric model.Techniques for learning the hyperparameters are based on the evaluation of thelikelihood function p(t|θ) where θ denotes the hyperparameters of the Gaussian process model. The simplest approach is to make a point estimate of θ by maximizingthe log likelihood function. Because θ represents a set of hyperparameters for theregression problem, this can be viewed as analogous to the type 2 maximum likelihood procedure for linear regression models.
Maximization of the log likelihoodcan be done using efficient gradient-based optimization algorithms such as conjugategradients (Fletcher, 1987; Nocedal and Wright, 1999; Bishop and Nabney, 2008).The log likelihood function for a Gaussian process regression model is easilyevaluated using the standard form for a multivariate Gaussian distribution, giving1N11ln(2π).ln p(t|θ) = − ln |CN | − tT C−N t−222(6.69)For nonlinear optimization, we also need the gradient of the log likelihood function with respect to the parameter vector θ.
We shall assume that evaluation of thederivatives of CN is straightforward, as would be the case for the covariance functions considered in this chapter. Making use of the result (C.21) for the derivative of1C−N , together with the result (C.22) for the derivative of ln |CN |, we obtain11∂−1 ∂CN1 ∂CNln p(t|θ) = − Tr CN+ tT C−C−1 t.(6.70)N∂θi2∂θi2∂θi NBecause ln p(t|θ) will in general be a nonconvex function, it can have multiple maxima.It is straightforward to introduce a prior over θ and to maximize the log posterior using gradient-based methods.
In a fully Bayesian treatment, we need to evaluatemarginals over θ weighted by the product of the prior p(θ) and the likelihood function p(t|θ). In general, however, exact marginalization will be intractable, and wemust resort to approximations.The Gaussian process regression model gives a predictive distribution whosemean and variance are functions of the input vector x.