The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 34
Текст из файла (страница 34)
Paraphrasing: with 30%more data, the conditional likelihood will do as well.For example, observations far from the decision boundary (which aredown-weighted by logistic regression) play a role in estimating the commoncovariance matrix. This is not all good news, because it also means thatLDA is not robust to gross outliers.From the mixture formulation, it is clear that even observations withoutclass labels have information about the parameters. Often it is expensiveto generate class labels, but unclassified observations come cheaply.
Byrelying on strong model assumptions, such as here, we can use both typesof information.The marginal likelihood can be thought of as a regularizer, requiringin some sense that class densities be visible from this marginal view. Forexample, if the data in a two-class logistic regression model can be perfectly separated by a hyperplane, the maximum likelihood estimates of theparameters are undefined (i.e., infinite; see Exercise 4.5). The LDA coefficients for the same data will be well defined, since the marginal likelihoodwill not permit these degeneracies.In practice these assumptions are never correct, and often some of thecomponents of X are qualitative variables. It is generally felt that logisticregression is a safer, more robust bet than the LDA model, relying on fewerassumptions.
It is our experience that the models give very similar results,even when LDA is used inappropriately, such as with qualitative predictors.4.5 Separating Hyperplanes129FIGURE 4.14. A toy example with two classes separable by a hyperplane. Theorange line is the least squares solution, which misclassifies one of the trainingpoints. Also shown are two blue separating hyperplanes found by the perceptronlearning algorithm with different random starts.4.5 Separating HyperplanesWe have seen that linear discriminant analysis and logistic regression bothestimate linear decision boundaries in similar but slightly different ways.For the rest of this chapter we describe separating hyperplane classifiers.These procedures construct linear decision boundaries that explicitly tryto separate the data into different classes as well as possible.
They providethe basis for support vector classifiers, discussed in Chapter 12. The mathematical level of this section is somewhat higher than that of the previoussections.Figure 4.14 shows 20 data points in two classes in IR2 . These data can beseparated by a linear boundary.
Included in the figure (blue lines) are twoof the infinitely many possible separating hyperplanes. The orange line isthe least squares solution to the problem, obtained by regressing the −1/1response Y on X (with intercept); the line is given by{x : β̂0 + β̂1 x1 + β̂2 x2 = 0}.(4.39)This least squares solution does not do a perfect job in separating thepoints, and makes one error. This is the same boundary found by LDA,in light of its equivalence with linear regression in the two-class case (Section 4.3 and Exercise 4.2).Classifiers such as (4.39), that compute a linear combination of the inputfeatures and return the sign, were called perceptrons in the engineering liter-1304. Linear Methods for Classificationxx0β0 + β T x = 0β∗FIGURE 4.15. The linear algebra of a hyperplane (affine set).ature in the late 1950s (Rosenblatt, 1958).
Perceptrons set the foundationsfor the neural network models of the 1980s and 1990s.Before we continue, let us digress slightly and review some vector algebra.Figure 4.15 depicts a hyperplane or affine set L defined by the equationf (x) = β0 + β T x = 0; since we are in IR2 this is a line.Here we list some properties:1.
For any two points x1 and x2 lying in L, β T (x1 − x2 ) = 0, and henceβ ∗ = β/||β|| is the vector normal to the surface of L.2. For any point x0 in L, β T x0 = −β0 .3. The signed distance of any point x to L is given byβ ∗ T (x − x0 )==1(β T x + β0 )kβk1f (x).′||f (x)||(4.40)Hence f (x) is proportional to the signed distance from x to the hyperplanedefined by f (x) = 0.4.5.1 Rosenblatt’s Perceptron Learning AlgorithmThe perceptron learning algorithm tries to find a separating hyperplane byminimizing the distance of misclassified points to the decision boundary. If4.5 Separating Hyperplanes131a response yi = 1 is misclassified, then xTi β + β0 < 0, and the opposite fora misclassified response with yi = −1.
The goal is to minimizeXD(β, β0 ) = −yi (xTi β + β0 ),(4.41)i∈Mwhere M indexes the set of misclassified points. The quantity is nonnegative and proportional to the distance of the misclassified points tothe decision boundary defined by β T x + β0 = 0.
The gradient (assumingM is fixed) is given by∂∂D(β, β0 )∂β=D(β, β0 )∂β0=−−Xyi x i ,(4.42)yi .(4.43)i∈MXi∈MThe algorithm in fact uses stochastic gradient descent to minimize thispiecewise linear criterion. This means that rather than computing the sumof the gradient contributions of each observation followed by a step in thenegative gradient direction, a step is taken after each observation is visited.Hence the misclassified observations are visited in some sequence, and theparameters β are updated via ββyi x i←+ρ.(4.44)β0β0yiHere ρ is the learning rate, which in this case can be taken to be 1 withoutloss in generality.
If the classes are linearly separable, it can be shown thatthe algorithm converges to a separating hyperplane in a finite number ofsteps (Exercise 4.6). Figure 4.14 shows two solutions to a toy problem, eachstarted at a different random guess.There are a number of problems with this algorithm, summarized inRipley (1996):• When the data are separable, there are many solutions, and whichone is found depends on the starting values.• The “finite” number of steps can be very large.
The smaller the gap,the longer the time to find it.• When the data are not separable, the algorithm will not converge,and cycles develop. The cycles can be long and therefore hard todetect.The second problem can often be eliminated by seeking a hyperplane notin the original space, but in a much enlarged space obtained by creating1324. Linear Methods for Classificationmany basis-function transformations of the original variables. This is analogous to driving the residuals in a polynomial regression problem downto zero by making the degree sufficiently large. Perfect separation cannotalways be achieved: for example, if observations from two different classesshare the same input. It may not be desirable either, since the resultingmodel is likely to be overfit and will not generalize well.
We return to thispoint at the end of the next section.A rather elegant solution to the first problem is to add additional constraints to the separating hyperplane.4.5.2 Optimal Separating HyperplanesThe optimal separating hyperplane separates the two classes and maximizesthe distance to the closest point from either class (Vapnik, 1996). Not onlydoes this provide a unique solution to the separating hyperplane problem,but by maximizing the margin between the two classes on the training data,this leads to better classification performance on test data.We need to generalize criterion (4.41).
Consider the optimization problemmaxβ,β0 ,||β||=1Msubject to yi (xTi β + β0 ) ≥ M, i = 1, . . . , N.(4.45)The set of conditions ensure that all the points are at least a signeddistance M from the decision boundary defined by β and β0 , and we seekthe largest such M and associated parameters. We can get rid of the ||β|| =1 constraint by replacing the conditions with1yi (xTi β + β0 ) ≥ M,||β||(4.46)(which redefines β0 ) or equivalentlyyi (xTi β + β0 ) ≥ M ||β||.(4.47)Since for any β and β0 satisfying these inequalities, any positively scaledmultiple satisfies them too, we can arbitrarily set ||β|| = 1/M . Thus (4.45)is equivalent to1min ||β||2β,β0 2subject to yi (xTi β + β0 ) ≥ 1, i = 1, . . .
, N.(4.48)In light of (4.40), the constraints define an empty slab or margin around thelinear decision boundary of thickness 1/||β||. Hence we choose β and β0 tomaximize its thickness. This is a convex optimization problem (quadratic4.5 Separating Hyperplanes133criterion with linear inequality constraints). The Lagrange (primal) function, to be minimized w.r.t. β and β0 , isLP=NX1αi [yi (xTi β + β0 ) − 1].||β||2 −2i=1(4.49)Setting the derivatives to zero, we obtain:β=NXαi yi x i ,(4.50)αi yi ,(4.51)i=10 =NXi=1and substituting these in (4.49) we obtain the so-called Wolfe dualLD=NXi=1Nαi −N1 XXαi αk yi yk xTi xk2 i=1k=1subject to αi ≥ 0 andNXαi yi = 0.(4.52)i=1The solution is obtained by maximizing LD in the positive orthant, a simpler convex optimization problem, for which standard software can be used.In addition the solution must satisfy the Karush–Kuhn–Tucker conditions,which include (4.50), (4.51), (4.52) andαi [yi (xTi β + β0 ) − 1] = 0 ∀i.(4.53)From these we can see that• if αi > 0, then yi (xTi β + β0 ) = 1, or in other words, xi is on theboundary of the slab;• if yi (xTi β + β0 ) > 1, xi is not on the boundary of the slab, and αi = 0.From (4.50) we see that the solution vector β is defined in terms of a linearcombination of the support points xi —those points defined to be on theboundary of the slab via αi > 0.
Figure 4.16 shows the optimal separatinghyperplane for our toy example; there are three support points. Likewise,β0 is obtained by solving (4.53) for any of the support points.The optimal separating hyperplane produces a function fˆ(x) = xT β̂ + β̂0for classifying new observations:Ĝ(x) = signfˆ(x).(4.54)Although none of the training observations fall in the margin (by construction), this will not necessarily be the case for test observations.
The1344. Linear Methods for ClassificationFIGURE 4.16. The same data as in Figure 4.14. The shaded region delineatesthe maximum margin separating the two classes. There are three support pointsindicated, which lie on the boundary of the margin, and the optimal separatinghyperplane (blue line) bisects the slab. Included in the figure is the boundary foundusing logistic regression (red line), which is very close to the optimal separatinghyperplane (see Section 12.3.3).intuition is that a large margin on the training data will lead to goodseparation on the test data.The description of the solution in terms of support points seems to suggest that the optimal hyperplane focuses more on the points that count,and is more robust to model misspecification. The LDA solution, on theother hand, depends on all of the data, even points far away from the decision boundary.
Note, however, that the identification of these supportpoints required the use of all the data. Of course, if the classes are reallyGaussian, then LDA is optimal, and separating hyperplanes will pay a pricefor focusing on the (noisier) data at the boundaries of the classes.Included in Figure 4.16 is the logistic regression solution to this problem, fit by maximum likelihood. Both solutions are similar in this case.When a separating hyperplane exists, logistic regression will always findit, since the log-likelihood can be driven to 0 in this case (Exercise 4.5).The logistic regression solution shares some other qualitative features withthe separating hyperplane solution. The coefficient vector is defined by aweighted least squares fit of a zero-mean linearized response on the inputfeatures, and the weights are larger for points near the decision boundarythan for those further away.When the data are not separable, there will be no feasible solution tothis problem, and an alternative formulation is needed.