Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 77
Текст из файла (страница 77)
We now modify thisapproach so that data points are allowed to be on the ‘wrong side’ of the marginboundary, but with a penalty that increases with the distance from that boundary. Forthe subsequent optimization problem, it is convenient to make this penalty a linearfunction of this distance. To do this, we introduce slack variables, ξn 0 wheren = 1, . . . , N , with one slack variable for each training data point (Bennett, 1992;Cortes and Vapnik, 1995). These are defined by ξn = 0 for data points that are on orinside the correct margin boundary and ξn = |tn − y(xn )| for other points.
Thus adata point that is on the decision boundary y(xn ) = 0 will have ξn = 1, and points3327. SPARSE KERNEL MACHINESFigure 7.3Illustration of the slack variables ξn 0.Data points with circles around them aresupport vectors.y = −1y=0y=1ξ>1ξ<1ξ=0ξ=0with ξn > 1 will be misclassified. The exact classification constraints (7.5) are thenreplaced withn = 1, . . .
, N(7.20)tn y(xn ) 1 − ξn ,in which the slack variables are constrained to satisfy ξn 0. Data points for whichξn = 0 are correctly classified and are either on the margin or on the correct sideof the margin. Points for which 0 < ξn 1 lie inside the margin, but on the correct side of the decision boundary, and those data points for which ξn > 1 lie onthe wrong side of the decision boundary and are misclassified, as illustrated in Figure 7.3. This is sometimes described as relaxing the hard margin constraint to give asoft margin and allows some of the training set data points to be misclassified.
Notethat while slack variables allow for overlapping class distributions, this framework isstill sensitive to outliers because the penalty for misclassification increases linearlywith ξ.Our goal is now to maximize the margin while softly penalizing points that lieon the wrong side of the margin boundary. We therefore minimizeCNn=11ξn + w22(7.21)where the parameter C > 0 controls the trade-off between the slack variable penaltyand the margin.
Because any point that is misclassified has ξn > 1, it follows thatn ξn is an upper bound on the number of misclassified points. The parameter C istherefore analogous to (the inverse of) a regularization coefficient because it controlsthe trade-off between minimizing training errors and controlling model complexity.In the limit C → ∞, we will recover the earlier support vector machine for separabledata.We now wish to minimize (7.21) subject to the constraints (7.20) together withξn 0. The corresponding Lagrangian is given byL(w, b, a) =1ξn −an {tn y(xn ) − 1 + ξn } −µn ξn (7.22)w2 + C2NNNn=1n=1n=17.1.
Maximum Margin ClassifiersAppendix E333where {an 0} and {µn 0} are Lagrange multipliers. The corresponding set ofKKT conditions are given byantn y(xn ) − 1 + ξnan (tn y(xn ) − 1 + ξn )µnξnµn ξn==000000(7.23)(7.24)(7.25)(7.26)(7.27)(7.28)where n = 1, . . . , N .We now optimize out w, b, and {ξn } making use of the definition (7.1) of y(x)to give∂Lan tn φ(xn )=0 ⇒ w=∂wN(7.29)n=1∂L=0 ⇒∂bNan tn = 0(7.30)n=1∂L= 0 ⇒ an = C − µn .∂ξn(7.31)Using these results to eliminate w, b, and {ξn } from the Lagrangian, we obtain thedual Lagrangian in the formL(a)=Nn=11an am tn tm k(xn , xm )2Nan −N(7.32)n=1 m=1which is identical to the separable case, except that the constraints are somewhatdifferent.
To see what these constraints are, we note that an 0 is required becausethese are Lagrange multipliers. Furthermore, (7.31) together with µn 0 impliesan C. We therefore have to minimize (7.32) with respect to the dual variables{an } subject to0 an CNan tn = 0(7.33)(7.34)n=1for n = 1, . . .
, N , where (7.33) are known as box constraints. This again representsa quadratic programming problem. If we substitute (7.29) into (7.1), we see thatpredictions for new data points are again made by using (7.13).We can now interpret the resulting solution. As before, a subset of the datapoints may have an = 0, in which case they do not contribute to the predictive3347. SPARSE KERNEL MACHINESmodel (7.13). The remaining data points constitute the support vectors. These havean > 0 and hence from (7.25) must satisfytn y(xn ) = 1 − ξn .(7.35)If an < C, then (7.31) implies that µn > 0, which from (7.28) requires ξn = 0 andhence such points lie on the margin. Points with an = C can lie inside the marginand can either be correctly classified if ξn 1 or misclassified if ξn > 1.To determine the parameter b in (7.1), we note that those support vectors forwhich 0 < an < C have ξn = 0 so that tn y(xn ) = 1 and hence will satisfyam tm k(xn , xm ) + b = 1.(7.36)tnm∈SAgain, a numerically stable solution is obtained by averaging to give1 b=tn −am tm k(xn , xm )NMn∈M(7.37)m∈Swhere M denotes the set of indices of data points having 0 < an < C.An alternative, equivalent formulation of the support vector machine, known asthe ν-SVM, has been proposed by Schölkopf et al.
(2000). This involves maximizing1L(a)=−2N Nan am tn tm k(xn , xm )(7.38)n=1 m=1subject to the constraints0 an 1/NNan tn = 0(7.39)(7.40)n=1Nan ν.(7.41)n=1This approach has the advantage that the parameter ν, which replaces C, can beinterpreted as both an upper bound on the fraction of margin errors (points for whichξn > 0 and hence which lie on the wrong side of the margin boundary and which mayor may not be misclassified) and a lower bound on the fraction of support vectors.
Anexample of the ν-SVM applied to a synthetic data set is shown in Figure 7.4. HereGaussian kernels of the form exp (−γx − x 2 ) have been used, with γ = 0.45.Although predictions for new inputs are made using only the support vectors,the training phase (i.e., the determination of the parameters a and b) makes use ofthe whole data set, and so it is important to have efficient algorithms for solving3357.1. Maximum Margin ClassifiersFigure 7.4Illustration of the ν-SVM appliedto a nonseparable data set in twodimensions.
The support vectorsare indicated by circles.20−2−202the quadratic programming problem. We first note that the objective function L(a)given by (7.10) or (7.32) is quadratic and so any local optimum will also be a globaloptimum provided the constraints define a convex region (which they do as a consequence of being linear). Direct solution of the quadratic programming problem using traditional techniques is often infeasible due to the demanding computation andmemory requirements, and so more practical approaches need to be found. The technique of chunking (Vapnik, 1982) exploits the fact that the value of the Lagrangianis unchanged if we remove the rows and columns of the kernel matrix correspondingto Lagrange multipliers that have value zero.
This allows the full quadratic programming problem to be broken down into a series of smaller ones, whose goal iseventually to identify all of the nonzero Lagrange multipliers and discard the others.Chunking can be implemented using protected conjugate gradients (Burges, 1998).Although chunking reduces the size of the matrix in the quadratic function from thenumber of data points squared to approximately the number of nonzero Lagrangemultipliers squared, even this may be too big to fit in memory for large-scale applications. Decomposition methods (Osuna et al., 1996) also solve a series of smallerquadratic programming problems but are designed so that each of these is of a fixedsize, and so the technique can be applied to arbitrarily large data sets. However, itstill involves numerical solution of quadratic programming subproblems and thesecan be problematic and expensive. One of the most popular approaches to trainingsupport vector machines is called sequential minimal optimization, or SMO (Platt,1999).
It takes the concept of chunking to the extreme limit and considers just twoLagrange multipliers at a time. In this case, the subproblem can be solved analytically, thereby avoiding numerical quadratic programming altogether. Heuristics aregiven for choosing the pair of Lagrange multipliers to be considered at each step.In practice, SMO is found to have a scaling with the number of data points that issomewhere between linear and quadratic depending on the particular application.We have seen that kernel functions correspond to inner products in feature spacesthat can have high, or even infinite, dimensionality.
By working directly in terms ofthe kernel function, without introducing the feature space explicitly, it might therefore seem that support vector machines somehow manage to avoid the curse of di-3367. SPARSE KERNEL MACHINESSection 1.4mensionality. This is not the case, however, because there are constraints amongstthe feature values that restrict the effective dimensionality of feature space. To seethis consider a simple second-order polynomial kernel that we can expand in termsof its components2k(x, z) = 1 + xT z = (1 + x1 z1 + x2 z2 )2= 1 + 2x1 z1 + 2x2 z2 + x21 z12 + 2x1 z1 x2 z2 + x22 z22√√√√√√= (1, 2x1 , 2x2 , x21 , 2x1 x2 , x22 )(1, 2z1 , 2z2 , z12 , 2z1 z2 , z22 )T= φ(x)T φ(z).(7.42)This kernel function therefore represents an inner product in a feature space havingsix dimensions, in which the mapping from input space to feature space is describedby the vector function φ(x). However, the coefficients weighting these differentfeatures are constrained to have specific forms.
Thus any set of points in the originaltwo-dimensional space x would be constrained to lie exactly on a two-dimensionalnonlinear manifold embedded in the six-dimensional feature space.We have already highlighted the fact that the support vector machine does notprovide probabilistic outputs but instead makes classification decisions for new input vectors. Veropoulos et al. (1999) discuss modifications to the SVM to allowthe trade-off between false positive and false negative errors to be controlled.
However, if we wish to use the SVM as a module in a larger probabilistic system, thenprobabilistic predictions of the class label t for new inputs x are required.To address this issue, Platt (2000) has proposed fitting a logistic sigmoid to theoutputs of a previously trained support vector machine. Specifically, the requiredconditional probability is assumed to be of the formp(t = 1|x) = σ (Ay(x) + B)(7.43)where y(x) is defined by (7.1). Values for the parameters A and B are found byminimizing the cross-entropy error function defined by a training set consisting ofpairs of values y(xn ) and tn . The data used to fit the sigmoid needs to be independentof that used to train the original SVM in order to avoid severe over-fitting.