Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 70
Текст из файла (страница 70)
Radial Basis Function NetworksSection 12.1.3299This is the covariance matrix of the Fisher scores, and so the Fisher kernel corresponds to a whitening of these scores. More simply, we can just omit the Fisherinformation matrix altogether and use the noninvariant kernelk(x, x ) = g(θ, x)T g(θ, x ).(6.36)An application of Fisher kernels to document retrieval is given by Hofmann (2000).A final example of a kernel function is the sigmoidal kernel given by(6.37)k(x, x ) = tanh axT x + bSection 6.4.7whose Gram matrix in general is not positive semidefinite.
This form of kernelhas, however, been used in practice (Vapnik, 1995), possibly because it gives kernelexpansions such as the support vector machine a superficial resemblance to neuralnetwork models. As we shall see, in the limit of an infinite number of basis functions,a Bayesian neural network with an appropriate prior reduces to a Gaussian process,thereby providing a deeper link between neural networks and kernel methods.6.3.
Radial Basis Function NetworksIn Chapter 3, we discussed regression models based on linear combinations of fixedbasis functions, although we did not discuss in detail what form those basis functionsmight take. One choice that has been widely used is that of radial basis functions,which have the property that each basis function depends only on the radial distance(typically Euclidean) from a centre µj , so that φj (x) = h(x − µj ).Historically, radial basis functions were introduced for the purpose of exact function interpolation (Powell, 1987). Given a set of input vectors {x1 , . .
. , xN } alongwith corresponding target values {t1 , . . . , tN }, the goal is to find a smooth functionf (x) that fits every target value exactly, so that f (xn ) = tn for n = 1, . . . , N . Thisis achieved by expressing f (x) as a linear combination of radial basis functions, onecentred on every data pointf (x) =Nwn h(x − xn ).(6.38)n=1The values of the coefficients {wn } are found by least squares, and because thereare the same number of coefficients as there are constraints, the result is a functionthat fits every target value exactly. In pattern recognition applications, however, thetarget values are generally noisy, and exact interpolation is undesirable because thiscorresponds to an over-fitted solution.Expansions in radial basis functions also arise from regularization theory (Poggio and Girosi, 1990; Bishop, 1995a).
For a sum-of-squares error function with aregularizer defined in terms of a differential operator, the optimal solution is givenby an expansion in the Green’s functions of the operator (which are analogous to theeigenvectors of a discrete matrix), again with one basis function centred on each data3006. KERNEL METHODSpoint. If the differential operator is isotropic then the Green’s functions depend onlyon the radial distance from the corresponding data point. Due to the presence of theregularizer, the solution no longer interpolates the training data exactly.Another motivation for radial basis functions comes from a consideration ofthe interpolation problem when the input (rather than the target) variables are noisy(Webb, 1994; Bishop, 1995a). If the noise on the input variable x is describedby a variable ξ having a distribution ν(ξ), then the sum-of-squares error functionbecomesN 12E={y(xn + ξ) − tn } ν(ξ) dξ.(6.39)2n=1Appendix DExercise 6.17Using the calculus of variations, we can optimize with respect to the function f (x)to giveNtn h(x − xn )(6.40)y(xn ) =n=1where the basis functions are given byh(x − xn ) =ν(x − xn )N.(6.41)ν(x − xn )n=1We see that there is one basis function centred on every data point.
This is known asthe Nadaraya-Watson model and will be derived again from a different perspectivein Section 6.3.1. If the noise distribution ν(ξ) is isotropic, so that it is a functiononly of ξ, then the basis functions will be radial.Note that the basis functions (6.41) are normalized, so that n h(x − xn ) = 1for any value of x. The effect of such normalization is shown in Figure 6.2. Normalization is sometimes used in practice as it avoids having regions of input space whereall of the basis functions take small values, which would necessarily lead to predictions in such regions that are either small or controlled purely by the bias parameter.Another situation in which expansions in normalized radial basis functions ariseis in the application of kernel density estimation to the problem of regression, as weshall discuss in Section 6.3.1.Because there is one basis function associated with every data point, the corresponding model can be computationally costly to evaluate when making predictionsfor new data points.
Models have therefore been proposed (Broomhead and Lowe,1988; Moody and Darken, 1989; Poggio and Girosi, 1990), which retain the expansion in radial basis functions but where the number M of basis functions is smallerthan the number N of data points. Typically, the number of basis functions, and thelocations µi of their centres, are determined based on the input data {xn } alone. Thebasis functions are then kept fixed and the coefficients {wi } are determined by leastsquares by solving the usual set of linear equations, as discussed in Section 3.1.1.6.3.
Radial Basis Function Networks110.80.80.60.60.40.40.20.20−1−0.500.510−1−0.500.53011Figure 6.2 Plot of a set of Gaussian basis functions on the left, together with the corresponding normalizedbasis functions on the right.Section 9.1One of the simplest ways of choosing basis function centres is to use a randomlychosen subset of the data points. A more systematic approach is called orthogonalleast squares (Chen et al., 1991).
This is a sequential selection process in which ateach step the next data point to be chosen as a basis function centre corresponds tothe one that gives the greatest reduction in the sum-of-squares error. Values for theexpansion coefficients are determined as part of the algorithm.
Clustering algorithmssuch as K-means have also been used, which give a set of basis function centres thatno longer coincide with training data points.6.3.1 Nadaraya-Watson modelSection 2.5.1In Section 3.3.3, we saw that the prediction of a linear regression model for anew input x takes the form of a linear combination of the training set target valueswith coefficients given by the ‘equivalent kernel’ (3.62) where the equivalent kernelsatisfies the summation constraint (3.64).We can motivate the kernel regression model (3.61) from a different perspective,starting with kernel density estimation. Suppose we have a training set {xn , tn } andwe use a Parzen density estimator to model the joint distribution p(x, t), so thatp(x, t) =N1 f (x − xn , t − tn )N(6.42)n=1where f (x, t) is the component density function, and there is one such componentcentred on each data point.
We now find an expression for the regression functiony(x), corresponding to the conditional average of the target variable conditioned on3026. KERNEL METHODSthe input variable, which is given by∞tp(t|x) dty(x) = E[t|x] =−∞tp(x, t) dt=p(x, t) dttf (x − xn , t − tn ) dt=n.(6.43)f (x − xm , t − tm ) dtmWe now assume for simplicity that the component density functions have zero meanso that ∞f (x, t)t dt = 0(6.44)−∞for all values of x. Using a simple change of variable, we then obtaing(x − xn )tnny(x) =g(x − xm )m=k(x, xn )tn(6.45)nwhere n, m = 1, .
. . , N and the kernel function k(x, xn ) is given byg(x − xn )k(x, xn ) = g(x − xm )(6.46)mand we have defined∞f (x, t) dt.g(x) =(6.47)−∞The result (6.45) is known as the Nadaraya-Watson model, or kernel regression(Nadaraya, 1964; Watson, 1964). For a localized kernel function, it has the property of giving more weight to the data points xn that are close to x. Note that thekernel (6.46) satisfies the summation constraintNn=1k(x, xn ) = 1.6.4. Gaussian Processes303Figure 6.3 Illustration of the Nadaraya-Watson kernel1.5regression model using isotropic Gaussian kernels, for thesinusoidal data set. The original sine function is shown1by the green curve, the data points are shown in blue,and each is the centre of an isotropic Gaussian kernel.The resulting regression function, given by the condi- 0.5tional mean, is shown by the red line, along with the two0standard-deviation region for the conditional distributionp(t|x) shown by the red shading.