Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 69
Текст из файла (страница 69)
Constructing Kernels1110.50.750.7500.50.5−0.50.250.25−1−1010−1010−10.040.040.040.020.020.02000−101295−101−10101Figure 6.1 Illustration of the construction of kernel functions starting from a corresponding set of basis functions. In each column the lower plot shows the kernel function k(x, x ) defined by (6.10) plotted as a function ofx for x = 0, while the upper plot shows the corresponding basis functions given by polynomials (left column),‘Gaussians’ (centre column), and logistic sigmoids (right column).If we take the particular case of a two-dimensional input space x = (x1 , x2 ) wecan expand out the terms and thereby identify the corresponding nonlinear featuremapping2k(x, z) = xT z = (x1 z1 + x2 z2 )2Appendix C= x21 z12 + 2x1 z1 x2 z2 + x22 z22√√= (x21 , 2x1 x2 , x22 )(z12 , 2z1 z2 , z22 )T= φ(x)T φ(z).(6.12)√We see that the feature mapping takes the form φ(x) = (x21 , 2x1 x2 , x22 )T andtherefore comprises all possible second order terms, with a specific weighting between them.More generally, however, we need a simple way to test whether a function constitutes a valid kernel without having to construct the function φ(x) explicitly.
Anecessary and sufficient condition for a function k(x, x ) to be a valid kernel (ShaweTaylor and Cristianini, 2004) is that the Gram matrix K, whose elements are given byk(xn , xm ), should be positive semidefinite for all possible choices of the set {xn }.Note that a positive semidefinite matrix is not the same thing as a matrix whoseelements are nonnegative.One powerful technique for constructing new kernels is to build them out ofsimpler kernels as building blocks.
This can be done using the following properties:2966. KERNEL METHODSTechniques for Constructing New Kernels.Given valid kernels k1 (x, x ) and k2 (x, x ), the following new kernels will alsobe valid:k(x, x )k(x, x )k(x, x )k(x, x )k(x, x )k(x, x )k(x, x )k(x, x )k(x, x )k(x, x )==========ck1 (x, x )f (x)k1 (x, x )f (x )q (k1 (x, x ))exp (k1 (x, x ))k1 (x, x ) + k2 (x, x )k1 (x, x )k2 (x, x )k3 (φ(x), φ(x ))xT Axka (xa , xa ) + kb (xb , xb )ka (xa , xa )kb (xb , xb )(6.13)(6.14)(6.15)(6.16)(6.17)(6.18)(6.19)(6.20)(6.21)(6.22)where c > 0 is a constant, f (·) is any function, q(·) is a polynomial with nonnegative coefficients, φ(x) is a function from x to RM , k3 (·, ·) is a valid kernel inRM , A is a symmetric positive semidefinite matrix, xa and xb are variables (notnecessarily disjoint) with x = (xa , xb ), and ka and kb are valid kernel functionsover their respective spaces.Equipped with these properties, we can now embark on the construction of morecomplex kernels appropriate to specific applications.
We require that the kernelk(x, x ) be symmetric and positive semidefinite and that it expresses the appropriateform of similarity between x and x according to the intended application. Here weconsider a few common examples of kernel functions. For a more extensive discussion of ‘kernel engineering’, see Shawe-Taylor and Cristianini (2004).2We saw that the simple polynomial kernel k(x, x ) = xT x contains onlyterms of degree two. If we consider the slightly generalized kernel k(x, x ) = T 2x x + c with c > 0, then the corresponding feature mapping φ(x) contains conMstant and linear terms as well as terms of order two.
Similarly, k(x, x ) = xT xcontains all monomials of order M . For instance, if x and x are two images, thenthe kernel represents a particular weighted sum of all possible products of M pixelsin the first image with M pixels in the second image. This can similarly be generMalized to include all terms up to degree M by considering k(x, x ) = xT x + cwith c > 0.
Using the results (6.17) and (6.18) for combining kernels we see thatthese will all be valid kernel functions.Another commonly used kernel takes the form(6.23)k(x, x ) = exp −x − x 2 /2σ 2and is often called a ‘Gaussian’ kernel. Note, however, that in this context it isnot interpreted as a probability density, and hence the normalization coefficient is6.2. Constructing Kernels297omitted. We can see that this is a valid kernel by expanding the squarex − x 2 = xT x + (x )T x − 2xT x(6.24)to givek(x, x ) = exp −xT x/2σ 2 exp xT x /σ 2 exp −(x )T x /2σ 2Exercise 6.11(6.25)and then making use of (6.14) and (6.16), together with the validity of the linearkernel k(x, x ) = xT x . Note that the feature vector that corresponds to the Gaussiankernel has infinite dimensionality.The Gaussian kernel is not restricted to the use of Euclidean distance.
If we usekernel substitution in (6.24) to replace xT x with a nonlinear kernel κ(x, x ), weobtain1(6.26)k(x, x ) = exp − 2 (κ(x, x) + κ(x , x ) − 2κ(x, x )) .2σAn important contribution to arise from the kernel viewpoint has been the extension to inputs that are symbolic, rather than simply vectors of real numbers. Kernelfunctions can be defined over objects as diverse as graphs, sets, strings, and text documents. Consider, for instance, a fixed set and define a nonvectorial space consistingof all possible subsets of this set.
If A1 and A2 are two such subsets then one simplechoice of kernel would bek(A1 , A2 ) = 2|A1 ∩A2 |Exercise 6.12(6.27)where A1 ∩ A2 denotes the intersection of sets A1 and A2 , and |A| denotes thenumber of subsets in A. This is a valid kernel function because it can be shown tocorrespond to an inner product in a feature space.One powerful approach to the construction of kernels starts from a probabilisticgenerative model (Haussler, 1999), which allows us to apply generative models in adiscriminative setting. Generative models can deal naturally with missing data andin the case of hidden Markov models can handle sequences of varying length. Bycontrast, discriminative models generally give better performance on discriminativetasks than generative models.
It is therefore of some interest to combine these twoapproaches (Lasserre et al., 2006). One way to combine them is to use a generativemodel to define a kernel, and then use this kernel in a discriminative approach.Given a generative model p(x) we can define a kernel byk(x, x ) = p(x)p(x ).(6.28)This is clearly a valid kernel function because we can interpret it as an inner productin the one-dimensional feature space defined by the mapping p(x). It says that twoinputs x and x are similar if they both have high probabilities.
We can use (6.13) and(6.17) to extend this class of kernels by considering sums over products of differentprobability distributions, with positive weighting coefficients p(i), of the formp(x|i)p(x |i)p(i).(6.29)k(x, x ) =i2986. KERNEL METHODSSection 9.2Section 13.2This is equivalent, up to an overall multiplicative constant, to a mixture distributionin which the components factorize, with the index i playing the role of a ‘latent’variable. Two inputs x and x will give a large value for the kernel function, andhence appear similar, if they have significant probability under a range of differentcomponents.
Taking the limit of an infinite sum, we can also consider kernels of theformk(x, x ) = p(x|z)p(x |z)p(z) dz(6.30)where z is a continuous latent variable.Now suppose that our data consists of ordered sequences of length L so thatan observation is given by X = {x1 , . . . , xL }. A popular generative model forsequences is the hidden Markov model, which expresses the distribution p(X) as amarginalization over a corresponding sequence of hidden states Z = {z1 , . . .
, zL }.We can use this approach to define a kernel function measuring the similarity of twosequences X and X by extending the mixture representation (6.29) to givek(X, X ) =p(X|Z)p(X |Z)p(Z)(6.31)Zso that both observed sequences are generated by the same hidden sequence Z.
Thismodel can easily be extended to allow sequences of differing length to be compared.An alternative technique for using generative models to define kernel functionsis known as the Fisher kernel (Jaakkola and Haussler, 1999). Consider a parametricgenerative model p(x|θ) where θ denotes the vector of parameters. The goal is tofind a kernel that measures the similarity of two input vectors x and x induced by thegenerative model. Jaakkola and Haussler (1999) consider the gradient with respectto θ, which defines a vector in a ‘feature’ space having the same dimensionality asθ.
In particular, they consider the Fisher scoreg(θ, x) = ∇θ ln p(x|θ)(6.32)from which the Fisher kernel is defined byk(x, x ) = g(θ, x)T F−1 g(θ, x ).Here F is the Fisher information matrix, given byF = Ex g(θ, x)g(θ, x)TExercise 6.13(6.33)(6.34)where the expectation is with respect to x under the distribution p(x|θ). This canbe motivated from the perspective of information geometry (Amari, 1998), whichconsiders the differential geometry of the space of model parameters. Here we simply note that the presence of the Fisher information matrix causes this kernel to beinvariant under a nonlinear re-parameterization of the density model θ → ψ(θ).In practice, it is often infeasible to evaluate the Fisher information matrix. Oneapproach is simply to replace the expectation in the definition of the Fisher information with the sample average, givingFN1 g(θ, xn )g(θ, xn )T .Nn=1(6.35)6.3.