Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 79
Текст из файла (страница 79)
Maximum Margin ClassifiersFigure 7.7Illustration of SVM regression, showingy(x)the regression curve together with the insensitive ‘tube’. Also shown are examples of the slack variables ξ and bξ. Pointsabove the -tube have ξ > 0 and bξ = 0,points below the -tube have ξ = 0 andbξ > 0, and points inside the -tube haveξ =bξ = 0.341y+ξ>0yy−ξ > 0xThe error function for support vector regression can then be written asCNn=11(ξn + ξn ) + w22(7.55)which must be minimized subject to the constraints ξn 0 and ξn 0 as well as(7.53) and (7.54). This can be achieved by introducing Lagrange multipliers an 0,an 0, µn 0, and µn 0 and optimizing the LagrangianL = CNn=1−N1(ξn + ξn ) + w2 −(µn ξn + µ nξn )2Nn=1an ( + ξn + yn − tn ) −n=1Nan ( + ξn − yn + tn ).
(7.56)n=1We now substitute for y(x) using (7.1) and then set the derivatives of the Laξn to zero, givinggrangian with respect to w, b, ξn , and ∂L(an − an )φ(xn )=0 ⇒ w=∂w(7.57)N∂L(an − an ) = 0=0 ⇒∂b(7.58)Nn=1n=1∂L= 0 ⇒ an + µn = C∂ξn∂L=0 ⇒ an + µn = C.∂ξnExercise 7.7(7.59)(7.60)Using these results to eliminate the corresponding variables from the Lagrangian, wesee that the dual problem involves maximizing3427. SPARSE KERNEL MACHINES1 L(a,a) = −2−N N(an − an )(am − am )k(xn , xm )n=1 m=1NN(an + an ) +(an − an )tnn=1(7.61)n=1with respect to {an } and {an }, where we have introduced the kernel k(x, x ) =Tφ(x) φ(x ). Again, this is a constrained maximization, and to find the constraintsan 0 are both required because these are Lagrangewe note that an 0 and multipliers.
Also µn 0 and µn 0 together with (7.59) and (7.60), requirean C and an C, and so again we have the box constraints0 an Can C0(7.62)(7.63)together with the condition (7.58).Substituting (7.57) into (7.1), we see that predictions for new inputs can be madeusingN(an − an )k(x, xn ) + b(7.64)y(x) =n=1which is again expressed in terms of the kernel function.The corresponding Karush-Kuhn-Tucker (KKT) conditions, which state that atthe solution the product of the dual variables and the constraints must vanish, aregiven byan ( + ξn + yn − tn ) = 0an ( + ξn − yn + tn ) = 0(C − an )ξn = 0(C − an )ξn = 0.(7.65)(7.66)(7.67)(7.68)From these we can obtain several useful results.
First of all, we note that a coefficientan can only be nonzero if + ξn + yn − tn = 0, which implies that the data pointeither lies on the upper boundary of the -tube (ξn = 0) or lies above the upperboundary (ξn > 0). Similarly, a nonzero value for an implies + ξn − yn + tn = 0,and such points must lie either on or below the lower boundary of the -tube.ξn − yn + tn = 0Furthermore, the two constraints + ξn + yn − tn = 0 and + are incompatible, as is easily seen by adding them together and noting that ξn andξn are nonnegative while is strictly positive, and so for every data point xn , eitheran or an (or both) must be zero.The support vectors are those data points that contribute to predictions given byan = 0.
These are points that(7.64), in other words those for which either an = 0 or lie on the boundary of the -tube or outside the tube. All points within the tube have7.1. Maximum Margin Classifiers343an = 0. We again have a sparse solution, and the only terms that have to bean = evaluated in the predictive model (7.64) are those that involve the support vectors.The parameter b can be found by considering a data point for which 0 < an <C, which from (7.67) must have ξn = 0, and from (7.65) must therefore satisfy + yn − tn = 0.
Using (7.1) and solving for b, we obtainb = tn − − wT φ(xn )N= tn − −(am − am )k(xn , xm )(7.69)m=1where we have used (7.57). We can obtain an analogous result by considering a pointfor which 0 < an < C. In practice, it is better to average over all such estimates ofb.As with the classification case, there is an alternative formulation of the SVMfor regression in which the parameter governing complexity has a more intuitiveinterpretation (Schölkopf et al., 2000). In particular, instead of fixing the width ofthe insensitive region, we fix instead a parameter ν that bounds the fraction of pointslying outside the tube.
This involves maximizing1 L(a,a) = −2+N N(an − an )(am − am )k(xn , xm )n=1 m=1N(an − an )tn(7.70)n=1subject to the constraints0 an C/Nan C/N0N(7.71)(7.72)(an − an ) = 0(7.73)(an + an ) νC.(7.74)n=1Nn=1Appendix AIt can be shown that there are at most νN data points falling outside the insensitivetube, while at least νN data points are support vectors and so lie either on the tubeor outside it.The use of a support vector machine to solve a regression problem is illustratedusing the sinusoidal data set in Figure 7.8. Here the parameters ν and C have beenchosen by hand. In practice, their values would typically be determined by crossvalidation.3447.
SPARSE KERNEL MACHINESFigure 7.8Illustration of the ν-SVM for regression applied to the sinusoidalsynthetic data set using Gaussiankernels. The predicted regressioncurve is shown by the red line, andthe -insensitive tube correspondsto the shaded region. Also, thedata points are shown in green,and those with support vectorsare indicated by blue circles.1t0−10x17.1.5 Computational learning theoryHistorically, support vector machines have largely been motivated and analysedusing a theoretical framework known as computational learning theory, also sometimes called statistical learning theory (Anthony and Biggs, 1992; Kearns and Vazirani, 1994; Vapnik, 1995; Vapnik, 1998). This has its origins with Valiant (1984)who formulated the probably approximately correct, or PAC, learning framework.The goal of the PAC framework is to understand how large a data set needs to be inorder to give good generalization.
It also gives bounds for the computational cost oflearning, although we do not consider these here.Suppose that a data set D of size N is drawn from some joint distribution p(x, t)where x is the input variable and t represents the class label, and that we restrictattention to ‘noise free’ situations in which the class labels are determined by some(unknown) deterministic function t = g(x).
In PAC learning we say that a functionf (x; D), drawn from a space F of such functions on the basis of the training setD, has good generalization if its expected error rate is below some pre-specifiedthreshold , so that(7.75)Ex,t [I (f (x; D) = t)] < where I(·) is the indicator function, and the expectation is with respect to the distribution p(x, t).
The quantity on the left-hand side is a random variable, becauseit depends on the training set D, and the PAC framework requires that (7.75) holds,with probability greater than 1 − δ, for a data set D drawn randomly from p(x, t).Here δ is another pre-specified parameter, and the terminology ‘probably approximately correct’ comes from the requirement that with high probability (greater than1 − δ), the error rate be small (less than ).
For a given choice of model space F, andfor given parameters and δ, PAC learning aims to provide bounds on the minimumsize N of data set needed to meet this criterion. A key quantity in PAC learning isthe Vapnik-Chervonenkis dimension, or VC dimension, which provides a measure ofthe complexity of a space of functions, and which allows the PAC framework to beextended to spaces containing an infinite number of functions.The bounds derived within the PAC framework are often described as worst-7.2. Relevance Vector Machines345case, because they apply to any choice for the distribution p(x, t), so long as boththe training and the test examples are drawn (independently) from the same distribution, and for any choice for the function f (x) so long as it belongs to F.
In real-worldapplications of machine learning, we deal with distributions that have significant regularity, for example in which large regions of input space carry the same class label.As a consequence of the lack of any assumptions about the form of the distribution,the PAC bounds are very conservative, in other words they strongly over-estimatethe size of data sets required to achieve a given generalization performance. For thisreason, PAC bounds have found few, if any, practical applications.One attempt to improve the tightness of the PAC bounds is the PAC-Bayesianframework (McAllester, 2003), which considers a distribution over the space F offunctions, somewhat analogous to the prior in a Bayesian treatment.