The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 92
Текст из файла (страница 92)
Hebb (1949) heavilyinfluenced the development of learning algorithms. The resurgence of neuralnetworks in the mid 1980s was due to Werbos (1974), Parker (1985) andRumelhart et al. (1986), who proposed the back-propagation algorithm.Today there are many books written on the topic, for a broad range ofaudiences. For readers of this book, Hertz et al. (1991), Bishop (1995) andRipley (1996) may be the most informative. Bayesian learning for neuralnetworks is described in Neal (1996). The ZIP code example was taken fromLe Cun (1989); see also Le Cun et al.
(1990) and Le Cun et al. (1998).We do not discuss theoretical topics such as approximation properties ofneural networks, such as the work of Barron (1993), Girosi et al. (1995)and Jones (1992). Some of these results are summarized by Ripley (1996).ExercisesEx. 11.1 Establish the exact correspondence between the projection pursuit regression model (11.1) and the neural network (11.5).
In particular,show that the single-layer regression network is equivalent to a PPR modelTTwith gm (ωmx) = βm σ(α0m + sm (ωmx)), where ωm is the mth unit vector.Establish a similar equivalence for a classification network.Ex. 11.2 Consider a neural network for a quantitative outcome as in (11.5),using squared-error loss and identity output function gk (t) = t. Supposethat the weights αm from the input to hidden layer are nearly zero.
Showthat the resulting model is nearly linear in the inputs.Ex. 11.3 Derive the forward and backward propagation equations for thecross-entropy loss function.Ex. 11.4 Consider a neural network for a K class outcome that uses crossentropy loss. If the network has no hidden layer, show that the model isequivalent to the multinomial logistic model described in Chapter 4.Ex. 11.5(a) Write a program to fit a single hidden layer neural network (ten hiddenunits) via back-propagation and weight decay.416Neural Networks(b) Apply it to 100 observations from the modelY = σ(aT1 X) + (aT2 X)2 + 0.30 · Z,where σ is the sigmoid function, Z is standard normal, X T = (X1 , X2 ),each Xj being independent standard normal, and a1 = (3, 3), a2 =(3, −3). Generate a test sample of size 1000, and plot the training andtest error curves as a function of the number of training epochs, fordifferent values of the weight decay parameter.
Discuss the overfittingbehavior in each case.(c) Vary the number of hidden units in the network, from 1 up to 10, anddetermine the minimum number needed to perform well for this task.Ex. 11.6 Write a program to carry out projection pursuit regression, usingcubic smoothing splines with fixed degrees of freedom. Fit it to the datafrom the previous exercise, for various values of the smoothing parameterand number of model terms. Find the minimum number of model termsnecessary for the model to perform well and compare this to the numberof hidden units from the previous exercise.Ex.
11.7 Fit a neural network to the spam data of Section 9.1.2, and comparethe results to those for the additive model given in that chapter. Compareboth the classification performance and interpretability of the final model.This is page 417Printer: Opaque this12Support Vector Machines andFlexible Discriminants12.1 IntroductionIn this chapter we describe generalizations of linear decision boundariesfor classification. Optimal separating hyperplanes are introduced in Chapter 4 for the case when two classes are linearly separable.
Here we coverextensions to the nonseparable case, where the classes overlap. These techniques are then generalized to what is known as the support vector machine,which produces nonlinear boundaries by constructing a linear boundary ina large, transformed version of the feature space.
The second set of methodsgeneralize Fisher’s linear discriminant analysis (LDA). The generalizationsinclude flexible discriminant analysis which facilitates construction of nonlinear boundaries in a manner very similar to the support vector machines,penalized discriminant analysis for problems such as signal and image classification where the large number of features are highly correlated, andmixture discriminant analysis for irregularly shaped classes.12.2 The Support Vector ClassifierIn Chapter 4 we discussed a technique for constructing an optimal separating hyperplane between two perfectly separated classes. We review this andgeneralize to the nonseparable case, where the classes may not be separableby a linear boundary.41812.
Flexible DiscriminantsxT β + β0 = 0• ••••••••M=••1kβkM=•• •• ∗ ••ξ5• •∗•ξ ∗• ξ31•∗•• •ξ2••• • •margin•••• ξ4∗•••• •xT β + β0 = 0•1kβk••• M =1kβkmarginM=1kβkFIGURE 12.1. Support vector classifiers. The left panel shows the separablecase. The decision boundary is the solid line, while broken lines bound the shadedmaximal margin of width 2M = 2/kβk. The right panel shows the nonseparable(overlap) case. The points labeled ξj∗ are on the wrong side of their margin byan amount ξj∗ = M ξj ; points on thePcorrect side have ξj∗ = 0.margin isP Themaximized subject to a total budgetξi ≤ constant. Henceξj∗ is the totaldistance of points on the wrong side of their margin.Our training data consists of N pairs (x1 , y1 ), (x2 , y2 ), .
. . , (xN , yN ), withxi ∈ IRp and yi ∈ {−1, 1}. Define a hyperplane by{x : f (x) = xT β + β0 = 0},(12.1)where β is a unit vector: kβk = 1. A classification rule induced by f (x) isG(x) = sign[xT β + β0 ].(12.2)The geometry of hyperplanes is reviewed in Section 4.5, where we show thatf (x) in (12.1) gives the signed distance from a point x to the hyperplanef (x) = xT β +β0 = 0. Since the classes are separable, we can find a functionf (x) = xT β + β0 with yi f (xi ) > 0 ∀i. Hence we are able to find thehyperplane that creates the biggest margin between the training points forclass 1 and −1 (see Figure 12.1).
The optimization problemmaxβ,β0 ,kβk=1Msubject to yi (xTi β + β0 ) ≥ M, i = 1, . . . , N,(12.3)captures this concept. The band in the figure is M units away from thehyperplane on either side, and hence 2M units wide. It is called the margin.We showed that this problem can be more conveniently rephrased asmin kβkβ,β0subject to yi (xTi β + β0 ) ≥ 1, i = 1, . . . , N,(12.4)12.2 The Support Vector Classifier419where we have dropped the norm constraint on β.
Note that M = 1/kβk.Expression (12.4) is the usual way of writing the support vector criterionfor separated data. This is a convex optimization problem (quadratic criterion, linear inequality constraints), and the solution is characterized inSection 4.5.2.Suppose now that the classes overlap in feature space. One way to dealwith the overlap is to still maximize M , but allow for some points to be onthe wrong side of the margin. Define the slack variables ξ = (ξ1 , ξ2 , . .
. , ξN ).There are two natural ways to modify the constraint in (12.3):yi (xTi β + β0 )yi (xTi βPN+ β0 )≥or≥M − ξi ,(12.5)M (1 − ξi ),(12.6)∀i, ξi ≥ 0,i=1 ξi ≤ constant. The two choices lead to different solutions.The first choice seems more natural, since it measures overlap in actualdistance from the margin; the second choice measures the overlap in relativedistance, which changes with the width of the margin M . However, the firstchoice results in a nonconvex optimization problem, while the second isconvex; thus (12.6) leads to the “standard” support vector classifier, whichwe use from here on.Here is the idea of the formulation.
The value ξi in the constraint yi (xTi β+β0 ) ≥ M (1 − ξi ) is the proportional amount by which the predictionf (xi )P= xTi β +β0 is on the wrong side of its margin. Hence by bounding thesumξi , we bound the total proportional amount by which predictionsfall on the wrongP side of their margin. Misclassifications occur when ξi > 1,so boundingξi at a value K say, bounds the total number of trainingmisclassifications at K.As in (4.48) in Section 4.5.2, we can drop the norm constraint on β,define M = 1/kβk, and write (12.4) in the equivalent form(yi (xTi β + β0 ) ≥ 1 − ξi ∀i,min kβk subject to(12.7)Pξi ≥ 0,ξi ≤ constant.This is the usual way the support vector classifier is defined for the nonseparable case. However we find confusing the presence of the fixed scale“1” in the constraint yi (xTi β + β0 ) ≥ 1 − ξi , and prefer to start with (12.6).The right panel of Figure 12.1 illustrates this overlapping case.By the nature of the criterion (12.7), we see that points well inside theirclass boundary do not play a big role in shaping the boundary.
This seemslike an attractive property, and one that differentiates it from linear discriminant analysis (Section 4.3). In LDA, the decision boundary is determined by the covariance of the class distributions and the positions of theclass centroids. We will see in Section 12.3.3 that logistic regression is moresimilar to the support vector classifier in this regard.42012. Flexible Discriminants12.2.1 Computing the Support Vector ClassifierThe problem (12.7) is quadratic with linear inequality constraints, hence itis a convex optimization problem. We describe a quadratic programmingsolution using Lagrange multipliers.
Computationally it is convenient tore-express (12.7) in the equivalent formNX1ξimin kβk2 + Cβ,β0 2i=1subject to ξi ≥ 0,(12.8)yi (xTi β+ β0 ) ≥ 1 − ξi ∀i,where the “cost” parameter C replaces the constant in (12.7); the separablecase corresponds to C = ∞.The Lagrange (primal) function isLP =NNNXXX1µi ξi , (12.9)αi [yi (xTi β + β0 ) − (1 − ξi )] −ξi −kβk2 + C2i=1i=1i=1which we minimize w.r.t β, β0 and ξi . Setting the respective derivatives tozero, we getβ=NXαi yi x i ,(12.10)αi yi ,(12.11)C − µi , ∀i,(12.12)i=10=NXi=1αi=as well as the positivity constraints αi , µi , ξi ≥ 0 ∀i.