Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 76
Текст из файла (страница 76)
We can use this freedom to settn wT φ(xn ) + b = 1(7.4)for the point that is closest to the surface. In this case, all data points will satisfy theconstraintstn wT φ(xn ) + b 1,n = 1, . . . , N.(7.5)This is known as the canonical representation of the decision hyperplane. In thecase of data points for which the equality holds, the constraints are said to be active,whereas for the remainder they are said to be inactive.
By definition, there willalways be at least one active constraint, because there will always be a closest point,and once the margin has been maximized there will be at least two active constraints.The optimization problem then simply requires that we maximize w−1 , which isequivalent to minimizing w2 , and so we have to solve the optimization problem1arg min w22w,bAppendix E(7.6)subject to the constraints given by (7.5). The factor of 1/2 in (7.6) is included forlater convenience.
This is an example of a quadratic programming problem in whichwe are trying to minimize a quadratic function subject to a set of linear inequalityconstraints. It appears that the bias parameter b has disappeared from the optimization. However, it is determined implicitly via the constraints, because these requirethat changes to w be compensated by changes to b. We shall see how this worksshortly.In order to solve this constrained optimization problem, we introduce Lagrangemultipliers an 0, with one multiplier an for each of the constraints in (7.5), givingthe Lagrangian functionN12an tn (wT φ(xn ) + b) − 1L(w, b, a) = w −2(7.7)n=1where a = (a1 , . .
. , aN )T . Note the minus sign in front of the Lagrange multiplierterm, because we are minimizing with respect to w and b, and maximizing withrespect to a. Setting the derivatives of L(w, b, a) with respect to w and b equal tozero, we obtain the following two conditionsw =Nan tn φ(xn )(7.8)an tn .(7.9)n=10 =Nn=17.1.
Maximum Margin Classifiers329Eliminating w and b from L(w, b, a) using these conditions then gives the dualrepresentation of the maximum margin problem in which we maximizeL(a)=N1an am tn tm k(xn , xm )2Nan −n=1N(7.10)n=1 m=1with respect to a subject to the constraintsanN 0,n = 1, . . . , N,an tn = 0.(7.11)(7.12)n=1Here the kernel function is defined by k(x, x ) = φ(x)T φ(x ). Again, this takes theform of a quadratic programming problem in which we optimize a quadratic functionof a subject to a set of inequality constraints. We shall discuss techniques for solvingsuch quadratic programming problems in Section 7.1.1.The solution to a quadratic programming problem in M variables in general hascomputational complexity that is O(M 3 ).
In going to the dual formulation we haveturned the original optimization problem, which involved minimizing (7.6) over Mvariables, into the dual problem (7.10), which has N variables. For a fixed set ofbasis functions whose number M is smaller than the number N of data points, themove to the dual problem appears disadvantageous. However, it allows the model tobe reformulated using kernels, and so the maximum margin classifier can be appliedefficiently to feature spaces whose dimensionality exceeds the number of data points,including infinite feature spaces. The kernel formulation also makes clear the roleof the constraint that the kernel function k(x, x ) be positive definite, because thisis bounded below, giving rise to a wellensures that the Lagrangian function L(a)defined optimization problem.In order to classify new data points using the trained model, we evaluate the signof y(x) defined by (7.1).
This can be expressed in terms of the parameters {an } andthe kernel function by substituting for w using (7.8) to givey(x) =Nan tn k(x, xn ) + b.(7.13)n=1Joseph-Louis Lagrange1736–1813Although widely considered to bea French mathematician, Lagrangewas born in Turin in Italy. By the ageof nineteen, he had already madeimportant contributions mathematics and had been appointed as Professor at the Royal Artillery School in Turin. For manyyears, Euler worked hard to persuade Lagrange tomove to Berlin, which he eventually did in 1766 wherehe succeeded Euler as Director of Mathematics atthe Berlin Academy. Later he moved to Paris, narrowly escaping with his life during the French revolution thanks to the personal intervention of Lavoisier(the French chemist who discovered oxygen) who himself was later executed at the guillotine.
Lagrangemade key contributions to the calculus of variationsand the foundations of dynamics.3307. SPARSE KERNEL MACHINESIn Appendix E, we show that a constrained optimization of this form satisfies theKarush-Kuhn-Tucker (KKT) conditions, which in this case require that the followingthree properties holdan 0tn y(xn ) − 1 0an {tn y(xn ) − 1} = 0.(7.14)(7.15)(7.16)Thus for every data point, either an = 0 or tn y(xn ) = 1. Any data point forwhich an = 0 will not appear in the sum in (7.13) and hence plays no role in makingpredictions for new data points. The remaining data points are called support vectors,and because they satisfy tn y(xn ) = 1, they correspond to points that lie on themaximum margin hyperplanes in feature space, as illustrated in Figure 7.1.
Thisproperty is central to the practical applicability of support vector machines. Oncethe model is trained, a significant proportion of the data points can be discarded andonly the support vectors retained.Having solved the quadratic programming problem and found a value for a, wecan then determine the value of the threshold parameter b by noting that any supportvector xn satisfies tn y(xn ) = 1. Using (7.13) this givesam tm k(xn , xm ) + b = 1(7.17)tnm∈Swhere S denotes the set of indices of the support vectors.
Although we can solvethis equation for b using an arbitrarily chosen support vector xn , a numerically morestable solution is obtained by first multiplying through by tn , making use of t2n = 1,and then averaging these equations over all support vectors and solving for b to give1 b=tn −am tm k(xn , xm )(7.18)NSn∈Sm∈Swhere NS is the total number of support vectors.For later comparison with alternative models, we can express the maximummargin classifier in terms of the minimization of an error function, with a simplequadratic regularizer, in the formNE∞ (y(xn )tn − 1) + λw2(7.19)n=1where E∞ (z) is a function that is zero if z 0 and ∞ otherwise and ensures thatthe constraints (7.5) are satisfied. Note that as long as the regularization parametersatisfies λ > 0, its precise value plays no role.Figure 7.2 shows an example of the classification resulting from training a support vector machine on a simple synthetic data set using a Gaussian kernel of the7.1.
Maximum Margin ClassifiersFigure 7.2331Example of synthetic data fromtwo classes in two dimensionsshowing contours of constanty(x) obtained from a supportvector machine having a Gaussian kernel function. Also shownare the decision boundary, themargin boundaries, and the support vectors.form (6.23). Although the data set is not linearly separable in the two-dimensionaldata space x, it is linearly separable in the nonlinear feature space defined implicitlyby the nonlinear kernel function.
Thus the training data points are perfectly separatedin the original data space.This example also provides a geometrical insight into the origin of sparsity inthe SVM. The maximum margin hyperplane is defined by the location of the supportvectors. Other data points can be moved around freely (so long as they remain outside the margin region) without changing the decision boundary, and so the solutionwill be independent of such data points.7.1.1 Overlapping class distributionsSo far, we have assumed that the training data points are linearly separable in thefeature space φ(x).
The resulting support vector machine will give exact separationof the training data in the original input space x, although the corresponding decisionboundary will be nonlinear. In practice, however, the class-conditional distributionsmay overlap, in which case exact separation of the training data can lead to poorgeneralization.We therefore need a way to modify the support vector machine so as to allowsome of the training points to be misclassified. From (7.19) we see that in the caseof separable classes, we implicitly used an error function that gave infinite errorif a data point was misclassified and zero error if it was classified correctly, andthen optimized the model parameters to maximize the margin.