The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 102
Текст из файла (страница 102)
Inregression, these “low error” points are the ones with small residuals.It is interesting to contrast this with error measures used in robust regression in statistics. The most popular, due to Huber (1964), has the form(r2 /2VH (r) =c|r| − c2 /2,if |r| ≤ c,|r| > c,(12.38)shown in the right panel of Figure 12.8. This function reduces from quadraticto linear the contributions of observations with absolute residual greaterthan a prechosen constant c. This makes the fitting less sensitive to outliers.
The support vector error measure (12.37) also has linear tails (beyondǫ), but in addition it flattens the contributions of those cases with smallresiduals.If β̂, β̂0 are the minimizers of H, the solution function can be shown tohave the formβ̂=NXi=1fˆ(x)=NXi=1(α̂i∗ − α̂i )xi ,(12.39)(α̂i∗ − α̂i )hx, xi i + β0 ,(12.40)43612.
Flexible Discriminantswhere α̂i , α̂i∗ are positive and solve the quadratic programming problemmin∗ ǫαi ,αiNXi=1(αi∗ + αi ) −NXi=1yi (αi∗ − αi ) +N1 X ∗(αi − αi )(αi∗′ − αi′ )hxi , xi′ i2 ′i,i =1subject to the constraints0 ≤ αi , αi∗ ≤ 1/λ,NXi=1(αi∗ − αi ) = 0,(12.41)αi αi∗ = 0.Due to the nature of these constraints, typically only a subset of the solutionvalues (α̂i∗ − α̂i ) are nonzero, and the associated data values are called thesupport vectors. As was the case in the classification setting, the solutiondepends on the input values only through the inner products hxi , xi′ i.
Thuswe can generalize the methods to richer spaces by defining an appropriateinner product, for example, one of those defined in (12.22).Note that there are parameters, ǫ and λ, associated with the criterion(12.36). These seem to play different roles. ǫ is a parameter of the lossfunction Vǫ , just like c is for VH . Note that both Vǫ and VH depend on thescale of y and hence r. If we scale our response (and hence use VH (r/σ) andVǫ (r/σ) instead), then we might consider using preset values for c and ǫ (thevalue c = 1.345 achieves 95% efficiency for the Gaussian). The quantity λis a more traditional regularization parameter, and can be estimated forexample by cross-validation.12.3.7 Regression and KernelsAs discussed in Section 12.3.3, this kernel property is not unique to support vector machines. Suppose we consider approximation of the regressionfunction in terms of a set of basis functions {hm (x)}, m = 1, 2, .
. . , M :f (x) =MXβm hm (x) + β0 .(12.42)m=1To estimate β and β0 we minimizeH(β, β0 ) =NXi=1V (yi − f (xi )) +λX 2βm2(12.43)for some general error measure V (r). For any choice of V (r), the solutionPfˆ(x) =β̂m hm (x) + β̂0 has the formfˆ(x) =NXi=1âi K(x, xi )(12.44)12.3 Support Vector Machines and Kernels437PMwith K(x, y) =m=1 hm (x)hm (y). Notice that this has the same formas both the radial basis function expansion and a regularization estimate,discussed in Chapters 5 and 6.For concreteness, let’s work out the case V (r) = r2 . Let H be the N × Mbasis matrix with imth element hm (xi ), and suppose that M > N is large.For simplicity we assume that β0 = 0, or that the constant is absorbed inh; see Exercise 12.3 for an alternative.We estimate β by minimizing the penalized least squares criterionH(β) = (y − Hβ)T (y − Hβ) + λkβk2 .(12.45)ŷ = Hβ̂(12.46)−HT (y − Hβ̂) + λβ̂ = 0.(12.47)The solution iswith β̂ determined byFrom this it appears that we need to evaluate the M × M matrix of innerproducts in the transformed space.
However, we can premultiply by H togiveHβ̂ = (HHT + λI)−1 HHT y.(12.48)The N × N matrix HHT consists of inner products between pairs of observations i, i′ ; that is, the evaluation of an inner product kernel {HHT }i,i′ =K(xi , xi′ ). It is easy to show (12.44) directly in this case, that the predictedvalues at an arbitrary x satisfyfˆ(x)==h(x)T β̂NXα̂i K(x, xi ),(12.49)i=1where α̂ = (HHT + λI)−1 y. As in the support vector machine, we need notspecify or evaluate the large set of functions h1 (x), h2 (x), .
. . , hM (x). Onlythe inner product kernel K(xi , xi′ ) need be evaluated, at the N trainingpoints for each i, i′ and at points x for predictions there. Careful choiceof hm (such as the eigenfunctions of particular, easy-to-evaluate kernelsK) means, for example, that HHT can be computed at a cost of N 2 /2evaluations of K, rather than the direct cost N 2 M .Note, however, that this property depends on the choice of squared normkβk2 in the penalty. It does not hold, for example, for the L1 norm |β|,which may lead to a superior model.43812.
Flexible Discriminants12.3.8 DiscussionThe support vector machine can be extended to multiclass problems, essentially by solving many two-class problems. A classifier is built for eachpair of classes, and the final classifier is the one that dominates the most(Kressel, 1999; Friedman, 1996; Hastie and Tibshirani, 1998). Alternatively,one could use the multinomial loss function along with a suitable kernel,as in Section 12.3.3.
SVMs have applications in many other supervisedand unsupervised learning problems. At the time of this writing, empiricalevidence suggests that it performs well in many real learning problems.Finally, we mention the connection of the support vector machine andstructural risk minimization (7.9). Suppose the training points (or theirbasis expansion) are contained in a sphere of radius R, and let G(x) =sign[f (x)] = sign[β T x + β0 ] as in (12.2). Then one can show that the classof functions {G(x), kβk ≤ A} has VC-dimension h satisfyingh ≤ R2 A2 .(12.50)If f (x) separates the training data, optimally for kβk ≤ A, then withprobability at least 1 − η over training sets (Vapnik, 1996, page 139):h[log (2N/h) + 1] − log (η/4).(12.51)NThe support vector classifier was one of the first practical learning procedures for which useful bounds on the VC dimension could be obtained,and hence the SRM program could be carried out.
However in the derivation, balls are put around the data points—a process that depends on theobserved values of the features. Hence in a strict sense, the VC complexityof the class is not fixed a priori, before seeing the features.The regularization parameter C controls an upper bound on the VCdimension of the classifier.
Following the SRM paradigm, we could choose Cby minimizing the upper bound on the test error, given in (12.51). However,it is not clear that this has any advantage over the use of cross-validationfor choice of C.Error Test ≤ 412.4 Generalizing Linear Discriminant AnalysisIn Section 4.3 we discussed linear discriminant analysis (LDA), a fundamental tool for classification. For the remainder of this chapter we discussa class of techniques that produce better classifiers than LDA by directlygeneralizing LDA.Some of the virtues of LDA are as follows:• It is a simple prototype classifier.
A new observation is classified to theclass with closest centroid. A slight twist is that distance is measuredin the Mahalanobis metric, using a pooled covariance estimate.12.4 Generalizing Linear Discriminant Analysis439• LDA is the estimated Bayes classifier if the observations are multivariate Gaussian in each class, with a common covariance matrix.Since this assumption is unlikely to be true, this might not seem tobe much of a virtue.• The decision boundaries created by LDA are linear, leading to decision rules that are simple to describe and implement.• LDA provides natural low-dimensional views of the data. For example, Figure 12.12 is an informative two-dimensional view of data in256 dimensions with ten classes.• Often LDA produces the best classification results, because of itssimplicity and low variance.
LDA was among the top three classifiersfor 7 of the 22 datasets studied in the STATLOG project (Michie etal., 1994)3 .Unfortunately the simplicity of LDA causes it to fail in a number of situations as well:• Often linear decision boundaries do not adequately separate the classes.When N is large, it is possible to estimate more complex decisionboundaries. Quadratic discriminant analysis (QDA) is often usefulhere, and allows for quadratic decision boundaries.
More generallywe would like to be able to model irregular decision boundaries.• The aforementioned shortcoming of LDA can often be paraphrasedby saying that a single prototype per class is insufficient. LDA usesa single prototype (class centroid) plus a common covariance matrixto describe the spread of the data in each class. In many situations,several prototypes are more appropriate.• At the other end of the spectrum, we may have way too many (correlated) predictors, for example, in the case of digitized analogue signalsand images.