The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 101
Текст из файла (страница 101)
... ... ... ... ... ... ... ... ... ... ... ...o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... o.................. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ...o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .................................................oFIGURE 12.5. The logistic regression versions of the SVM models in Figure 12.3, using the identical kernels and hence penalties, but the log-likelihoodloss instead of the SVM loss function. The two broken contours correspond toposterior probabilities of 0.75 and 0.25 for the +1 class (or vice versa). The broken purple curve in the background is the Bayes decision boundary.12.3 Support Vector Machines and Kernels431TABLE 12.2.
Skin of the orange: Shown are mean (standard error of the mean)of the test error over 50 simulations. BRUTO fits an additive spline model adaptively, while MARS fits a low-order interaction model adaptively.123456MethodSV ClassifierSVM/poly 2SVM/poly 5SVM/poly 10BRUTOMARSBayesTest Error (SE)No Noise Features Six Noise Features0.450 (0.003)0.472 (0.003)0.078 (0.003)0.152 (0.004)0.180 (0.004)0.370 (0.004)0.230 (0.003)0.434 (0.002)0.084 (0.003)0.090 (0.003)0.156 (0.004)0.173 (0.005)0.0290.02912.3.4 SVMs and the Curse of DimensionalityIn this section, we address the question of whether SVMs have some edgeon the curse of dimensionality.
Notice that in expression (12.23) we are notallowed a fully general inner product in the space of powers and products.For example, all terms of the form 2Xj Xj′ are given equal weight, and thekernel cannot adapt itself to concentrate on subspaces. If the number offeatures p were large, but the class separation occurred only in the linearsubspace spanned by say X1 and X2 , this kernel would not easily find thestructure and would suffer from having many dimensions to search over.One would have to build knowledge about the subspace into the kernel;that is, tell it to ignore all but the first two inputs. If such knowledge wereavailable a priori, much of statistical learning would be made much easier.A major goal of adaptive methods is to discover such structure.We support these statements with an illustrative example.
We generated100 observations in each of two classes. The first class has four standardnormal independent features X1 , X2 , X3 , X4 . The second class alsofourP hasstandard normal independent features, but conditioned on 9 ≤Xj2 ≤ 16.This is a relatively easy problem. As a second harder problem, we augmented the features with an additional six standard Gaussian noise features. Hence the second class almost completely surrounds the first, like theskin surrounding the orange, in a four-dimensional subspace. The Bayes error rate for this problem is 0.029 (irrespective of dimension). We generated1000 test observations to compare different procedures.
The average testerrors over 50 simulations, with and without noise features, are shown inTable 12.2.Line 1 uses the support vector classifier in the original feature space.Lines 2–4 refer to the support vector machine with a 2-, 5- and 10-dimensional polynomial kernel. For all support vector procedures, we chose the costparameter C to minimize the test error, to be as fair as possible to the43212. Flexible DiscriminantsTest Error Curves − SVM with Radial Kernelγ=1γ = 0.5γ = 0.10.300.250.20Test Error0.35γ=51e−011e+011e+031e−011e+011e+031e−011e+011e+031e−011e+011e+03CFIGURE 12.6. Test-error curves as a function of the cost parameter C for theradial-kernel SVM classifier on the mixture data. At the top of each plot is thescale parameter γ for the radial kernel: Kγ (x, y) = exp −γ||x − y||2 . The optimalvalue for C depends quite strongly on the scale of the kernel. The Bayes errorrate is indicated by the broken horizontal lines.method.
Line 5 fits an additive spline model to the (−1, +1) response byleast squares, using the BRUTO algorithm for additive models, describedin Hastie and Tibshirani (1990). Line 6 uses MARS (multivariate adaptiveregression splines) allowing interaction of all orders, as described in Chapter 9; as such it is comparable with the SVM/poly 10.
Both BRUTO andMARS have the ability to ignore redundant variables. Test error was notused to choose the smoothing parameters in either of lines 5 or 6.In the original feature space, a hyperplane cannot separate the classes,and the support vector classifier (line 1) does poorly. The polynomial support vector machine makes a substantial improvement in test error rate,but is adversely affected by the six noise features. It is also very sensitive tothe choice of kernel: the second degree polynomial kernel (line 2) does best,since the true decision boundary is a second-degree polynomial.
However,higher-degree polynomial kernels (lines 3 and 4) do much worse. BRUTOperforms well, since the boundary is additive. BRUTO and MARS adaptwell: their performance does not deteriorate much in the presence of noise.12.3.5 A Path Algorithm for the SVM ClassifierThe regularization parameter for the SVM classifier is the cost parameterC, or its inverse λ in (12.25). Common usage is to set C high, leading oftento somewhat overfit classifiers.Figure 12.6 shows the test error on the mixture data as a function ofC, using different radial-kernel parameters γ. When γ = 5 (narrow peakedkernels), the heaviest regularization (small C) is called for. With γ = 14331.51012.3 Support Vector Machines and Kernels941.089811f (x) = +1f (x) = 01051/||β||3λ0.5670.041281f (x) = −111−0.522662710 123 50−1.014−0.50.00.51.01.52.00.00.20.40.60.81.0αi (λ)FIGURE 12.7.
A simple example illustrates the SVM path algorithm. (leftpanel:) This plot illustrates the state of the model at λ = 0.5. The ‘‘ + 1”points are orange, the “−1” blue. λ = 1/2, and the width of the soft marginis 2/||β|| = 2 × 0.587. Two blue points {3, 5} are misclassified, while the two orange points {10, 12} are correctly classified, but on the wrong side of their marginf (x) = +1; each of these has yi f (xi ) < 1. The three square shaped points {2, 6, 7}are exactly on their margins.
(right panel:) This plot shows the piecewise linearprofiles αi (λ). The horizontal broken line at λ = 1/2 indicates the state of the αifor the model in the left plot.(the value used in Figure 12.3), an intermediate value of C is required.Clearly in situations such as these, we need to determine a good choicefor C, perhaps by cross-validation. Here we describe a path algorithm (inthe spirit of Section 3.8) for efficiently fitting the entire sequence of SVMmodels obtained by varying C.It is convenient to use the loss+penalty formulation (12.25), along withFigure 12.4.
This leads to a solution for β at a given value of λ:Nβλ =1Xαi yi x i .λ i=1(12.33)The αi are again Lagrange multipliers, but in this case they all lie in [0, 1].Figure 12.7 illustrates the setup. It can be shown that the KKT optimality conditions imply that the labeled points (xi , yi ) fall into three distinctgroups:43412. Flexible Discriminants• Observations correctly classified and outside their margins. They haveyi f (xi ) > 1, and Lagrange multipliers αi = 0.
Examples are theorange points 8, 9 and 11, and the blue points 1 and 4.• Observations sitting on their margins with yi f (xi ) = 1, with Lagrangemultipliers αi ∈ [0, 1]. Examples are the orange 7 and the blue 2 and8.• Observations inside their margins have yi f (xi ) < 1, with αi = 1.Examples are the blue 3 and 5, and the orange 10 and 12.The idea for the path algorithm is as follows. Initially λ is large, themargin 1/||βλ || is wide, and all points are inside their margin and haveαi = 1.
As λ decreases, 1/||βλ || decreases, and the margin gets narrower.Some points will move from inside their margins to outside their margins,and their αi will change from 1 to 0. By continuity of the αi (λ), these pointswill linger on the margin during this transition. From (12.33) we see thatthe points with αi = 1 make fixed contributions to β(λ), and those withαi = 0 make no contribution. So all that changes as λ decreases are theαi ∈ [0, 1] of those (small number) of points on the margin. Since all thesepoints have yi f (xi ) = 1, this results in a small set of linear equations thatprescribe how αi (λ) and hence βλ changes during these transitions.
Thisresults in piecewise linear paths for each of the αi (λ). The breaks occurwhen points cross the margin. Figure 12.7 (right panel) shows the αi (λ)profiles for the small example in the left panel.Although we have described this for linear SVMs, exactly the same ideaworks for nonlinear models, in which (12.33) is replaced byNfλ (x) =1Xαi yi K(x, xi ).λ i=1(12.34)Details can be found in Hastie et al. (2004). An R package svmpath isavailable on CRAN for fitting these models.12.3.6 Support Vector Machines for RegressionIn this section we show how SVMs can be adapted for regression with aquantitative response, in ways that inherit some of the properties of theSVM classifier. We first discuss the linear regression modelf (x) = xT β + β0 ,(12.35)and then handle nonlinear generalizations.
To estimate β, we consider minimization ofH(β, β0 ) =NXi=1V (yi − f (xi )) +λkβk2 ,2(12.36)8 10 12435246VH (r)21-100Vǫ (r)3412.3 Support Vector Machines and Kernels-4−ǫ-2ǫ024-4−cc-2r024rFIGURE 12.8. The left panel shows the ǫ-insensitive error function used by thesupport vector regression machine. The right panel shows the error function usedin Huber’s robust regression (blue curve). Beyond |c|, the function changes fromquadratic to linear.whereVǫ (r) =(0|r| − ǫ,if |r| < ǫ,otherwise.(12.37)This is an “ǫ-insensitive” error measure, ignoring errors of size less thanǫ (left panel of Figure 12.8). There is a rough analogy with the supportvector classification setup, where points on the correct side of the decision boundary and far away from it, are ignored in the optimization.