The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 98
Текст из файла (страница 98)
Ineach case C was tuned to approximately achieve the best test error performance,and C = 1 worked well in both cases. The radial basis kernel performs the best(close to Bayes optimal), as might be expected given the data arise from mixturesof Gaussians. The broken purple curve in the background is the Bayes decisionboundary.12. Flexible Discriminants3.04261.50.00.51.0Loss2.02.5Hinge LossBinomial DevianceSquared ErrorClass Huber−3−2−10123yfFIGURE 12.4. The support vector loss function (hinge loss), compared to thenegative log-likelihood loss (binomial deviance) for logistic regression, squared-error loss, and a “Huberized” version of the squared hinge loss. All are shown as afunction of yf rather than f , because of the symmetry between the y = +1 andy = −1 case. The deviance and Huber have the same asymptotes as the SVMloss, but are rounded in the interior.
All are scaled to have the limiting left-tailslope of −1.12.3.2 The SVM as a Penalization MethodWith f (x) = h(x)T β + β0 , consider the optimization problemminβ0 , βNXi=1[1 − yi f (xi )]+ +λkβk22(12.25)where the subscript “+” indicates positive part. This has the form loss +penalty, which is a familiar paradigm in function estimation. It is easy toshow (Exercise 12.1) that the solution to (12.25), with λ = 1/C, is thesame as that for (12.8).Examination of the “hinge” loss function L(y, f ) = [1 − yf ]+ shows thatit is reasonable for two-class classification, when compared to other moretraditional loss functions. Figure 12.4 compares it to the log-likelihood lossfor logistic regression, as well as squared-error loss and a variant thereof.The (negative) log-likelihood or binomial deviance has similar tails as theSVM loss, giving zero penalty to points well inside their margin, and a12.3 Support Vector Machines and Kernels427TABLE 12.1.
The population minimizers for the different loss functions in Figure 12.4. Logistic regression uses the binomial log-likelihood or deviance. Lineardiscriminant analysis (Exercise 4.2) uses squared-error loss. The SVM hinge lossestimates the mode of the posterior class probabilities, whereas the others estimatea linear transformation of these probabilities.Loss FunctionBinomialDevianceSVM HingeLossSquaredError“Huberised”SquareHinge LossL[y, f (x)]log[1 + e−yf (x) ][1 − yf (x)]+[y − f (x)]2 = [1 − yf (x)]2−4yf (x),[1 − yf (x)]2+yf (x) < -1Minimizing Functionf (x) = logPr(Y = +1|x)Pr(Y = -1|x)f (x) = sign[Pr(Y = +1|x) − 12 ]f (x) = 2Pr(Y = +1|x) − 1f (x) = 2Pr(Y = +1|x) − 1otherwiselinear penalty to points on the wrong side and far away.
Squared-error, onthe other hand gives a quadratic penalty, and points well inside their ownmargin have a strong influence on the model as well. The squared hingeloss L(y, f ) = [1 − yf ]2+ is like the quadratic, except it is zero for pointsinside their margin. It still rises quadratically in the left tail, and will beless robust than hinge or deviance to misclassified observations. RecentlyRosset and Zhu (2007) proposed a “Huberized” version of the squared hingeloss, which converts smoothly to a linear loss at yf = −1.We can characterize these loss functions in terms of what they are estimating at the population level. We consider minimizing EL(Y, f (X)).Table 12.1 summarizes the results.
Whereas the hinge loss estimates theclassifier G(x) itself, all the others estimate a transformation of the classposterior probabilities. The “Huberized” square hinge loss shares attractiveproperties of logistic regression (smooth loss function, estimates probabilities), as well as the SVM hinge loss (support points).Formulation (12.25) casts the SVM as a regularized function estimationproblem, where the coefficients of the linear expansion f (x) = β0 + h(x)T βare shrunk toward zero (excluding the constant). If h(x) represents a hierarchical basis having some ordered structure (such as ordered in roughness),42812.
Flexible Discriminantsthen the uniform shrinkage makes more sense if the rougher elements hj inthe vector h have smaller norm.All the loss-functions in Table 12.1 except squared-error are so called“margin maximizing loss-functions” (Rosset et al., 2004b). This means thatif the data are separable, then the limit of β̂λ in (12.25) as λ → 0 definesthe optimal separating hyperplane1 .12.3.3 Function Estimation and Reproducing KernelsHere we describe SVMs in terms of function estimation in reproducingkernel Hilbert spaces, where the kernel property abounds.
This material isdiscussed in some detail in Section 5.8. This provides another view of thesupport vector classifier, and helps to clarify how it works.Suppose the basis h arises from the (possibly finite) eigen-expansion ofa positive definite kernel K,K(x, x′ ) =∞Xφm (x)φm (x′ )δm(12.26)m=1√√and hm (x) = δm φm (x). Then with θm = δm βm , we can write (12.25)as#"∞∞N2XXλ X θm.(12.27)minθm φm (xi )) +1 − yi (β0 +β0 , θ2 m=1 δmm=1i=1+Now (12.27) is identical in form to (5.49) on page 169 in Section 5.8, andthe theory of reproducing kernel Hilbert spaces described there guaranteesa finite-dimensional solution of the formf (x) = β0 +NXαi K(x, xi ).(12.28)i=1In particular we see there an equivalent version of the optimization criterion (12.19) [Equation (5.67) in Section 5.8.2; see also Wahba et al.
(2000)],minβ0 ,αNXi=1(1 − yi f (xi ))+ +λ Tα Kα,2(12.29)where K is the N × N matrix of kernel evaluations for all pairs of trainingfeatures (Exercise 12.2).These models are quite general, and include, for example, the entire family of smoothing splines, additive and interaction spline models discussed1 For logistic regression with separable data, β̂ diverges, but β̂ /||β̂ converges toλλλthe optimal separating direction.12.3 Support Vector Machines and Kernels429in Chapters 5 and 9, and in more detail in Wahba (1990) and Hastie andTibshirani (1990).
They can be expressed more generally asminf ∈HNXi=1[1 − yi f (xi )]+ + λJ(f ),(12.30)where H is the structured space of functions, and J(f ) an appropriate regularizer on that space.R is the space of additivePp For example, supposeP Hfunctions f (x) = j=1 fj (xj ), and J(f ) = j {f ′′ j (xj )}2 dxj . Then thesolution to (12.30) is an additivePp cubic spline, and has a kernel representation (12.28) with K(x, x′ ) = j=1 Kj (xj , x′j ). Each of the Kj is the kernelappropriate for the univariate smoothing spline in xj (Wahba, 1990).Conversely this discussion also shows that, for example, any of the kernelsdescribed in (12.22) above can be used with any convex loss function, andwill also lead to a finite-dimensional representation of the form (12.28).Figure 12.5 uses the same kernel functions as in Figure 12.3, except usingthe binomial log-likelihood as a loss function2 .
The fitted function is hencean estimate of the log-odds,fˆ(x)P̂r(Y = +1|x)=log=β̂0 +P̂r(Y = −1|x)NXα̂i K(x, xi ),(12.31)i=1or conversely we get an estimate of the class probabilitiesP̂r(Y = +1|x) =11+e−β̂0 −PNi=1α̂i K(x,xi ).(12.32)The fitted models are quite similar in shape and performance. Examplesand more details are given in Section 5.8.It does happen that for SVMs, a sizable fraction of the N values of αican be zero (the nonsupport points). In the two examples in Figure 12.3,these fractions are 42% and 45%, respectively.
This is a consequence of thepiecewise linear nature of the first part of the criterion (12.25). The lowerthe class overlap (on the training data), the greater this fraction will be.Reducing λ will generally reduce the overlap (allowing a more flexible f ).A small number of support points means that fˆ(x) can be evaluated morequickly, which is important at lookup time. Of course, reducing the overlaptoo much can lead to poor generalization.2 JiZhu assisted in the preparation of these examples.43012. Flexible DiscriminantsLR - Degree-4 Polynomial in Feature Space.. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..