The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 35
Текст из файла (страница 35)
Again one can enlarge the space using basis transformations, but this can lead to artificialExercises135separation through over-fitting. In Chapter 12 we discuss a more attractivealternative known as the support vector machine, which allows for overlap,but minimizes a measure of the extent of this overlap.Bibliographic NotesGood general texts on classification include Duda et al. (2000), Hand(1981), McLachlan (1992) and Ripley (1996). Mardia et al.
(1979) havea concise discussion of linear discriminant analysis. Michie et al. (1994)compare a large number of popular classifiers on benchmark datasets. Linear separating hyperplanes are discussed in Vapnik (1996). Our account ofthe perceptron learning algorithm follows Ripley (1996).ExercisesEx. 4.1 Show how to solve the generalized eigenvalue problem max aT Basubject to aT Wa = 1 by transforming to a standard eigenvalue problem.Ex.
4.2 Suppose we have features x ∈ IRp , a two-class response, with classsizes N1 , N2 , and the target coded as −N/N1 , N/N2 .(a) Show that the LDA rule classifies to class 2 ifxT Σ̂−1(µ̂2 − µ̂1 ) >−11(µ̂2 + µ̂1 )T Σ̂ (µ̂2 − µ̂1 ) − log(N2 /N1 ),2and class 1 otherwise.(b) Consider minimization of the least squares criterionNXi=1(yi − β0 − xTi β)2 .Show that the solution β̂ satisfiesih(N − 2)Σ̂ + N Σ̂B β = N (µ̂2 − µ̂1 )(after simplification), where Σ̂B =N1 N2N 2 (µ̂2(4.55)(4.56)− µ̂1 )(µ̂2 − µ̂1 )T .(c) Hence show that Σ̂B β is in the direction (µ̂2 − µ̂1 ) and thusβ̂ ∝ Σ̂−1(µ̂2 − µ̂1 ).(4.57)Therefore the least-squares regression coefficient is identical to theLDA coefficient, up to a scalar multiple.1364. Linear Methods for Classification(d) Show that this result holds for any (distinct) coding of the two classes.(e) Find the solution β̂0 (up to the same scalar multiple as in (c), andhence the predicted value fˆ(x) = β̂0 + xT β̂.
Consider the followingrule: classify to class 2 if fˆ(x) > 0 and class 1 otherwise. Show this isnot the same as the LDA rule unless the classes have equal numbersof observations.(Fisher, 1936; Ripley, 1996)Ex. 4.3 Suppose we transform the original predictors X to Ŷ via linearregression. In detail, let Ŷ = X(XT X)−1 XT Y = XB̂, where Y is theindicator response matrix. Similarly for any input x ∈ IRp , we get a transformed vector ŷ = B̂T x ∈ IRK . Show that LDA using Ŷ is identical toLDA in the original space.Ex. 4.4 Consider the multilogit model with K classes (4.17). Let β be the(p + 1)(K − 1)-vector consisting of all the coefficients. Define a suitablyenlarged version of the input vector x to accommodate this vectorized coefficient matrix.
Derive the Newton-Raphson algorithm for maximizing themultinomial log-likelihood, and describe how you would implement thisalgorithm.Ex. 4.5 Consider a two-class logistic regression problem with x ∈ IR. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample xi for the two classes are separated by a point x0 ∈ IR.Generalize this result to (a) x ∈ IRp (see Figure 4.16), and (b) more thantwo classes.Ex. 4.6 Suppose we have N points xi in IRp in general position, with classlabels yi ∈ {−1, 1}. Prove that the perceptron learning algorithm convergesto a separating hyperplane in a finite number of steps:(a) Denote a hyperplane by f (x) = β1T x + β0 = 0, or in more compactnotation β T x∗ = 0, where x∗ = (x, 1) and β = (β1 , β0 ).
Let zi =x∗i /||x∗i ||. Show that separability implies the existence of a βsep suchTthat yi βsepzi ≥ 1 ∀i(b) Given a current βold , the perceptron algorithm identifies a point zi thatis misclassified, and produces the update βnew ← βold + yi zi . Showthat ||βnew −βsep ||2 ≤ ||βold −βsep ||2 −1, and hence that the algorithmconverges to a separating hyperplane in no more than ||βstart − βsep ||2steps (Ripley, 1996).Ex. 4.7 Consider the criterion∗D (β, β0 ) = −NXi=1yi (xTi β + β0 ),(4.58)Exercises137a generalization of (4.41) where we sum over all the observations. Considerminimizing D∗ subject to ||β|| = 1. Describe this criterion in words. Doesit solve the optimal separating hyperplane problem?Ex.
4.8 Consider the multivariate Gaussian model X|G = k ∼ N (µk , Σ),with the additional restriction that rank{µk }K1 = L < max(K − 1, p).Derive the constrained MLEs for the µk and Σ. Show that the Bayes classification rule is equivalent to classifying in the reduced subspace computedby LDA (Hastie and Tibshirani, 1996b).Ex. 4.9 Write a computer program to perform a quadratic discriminantanalysis by fitting a separate Gaussian model per class. Try it out on thevowel data, and compute the misclassification error for the test data. Thedata can be found in the book website www-stat.stanford.edu/ElemStatLearn.1384.
Linear Methods for ClassificationThis is page 139Printer: Opaque this5Basis Expansions and Regularization5.1 IntroductionWe have already made use of models linear in the input features, both forregression and classification. Linear regression, linear discriminant analysis,logistic regression and separating hyperplanes all rely on a linear model.It is extremely unlikely that the true function f (X) is actually linear inX.
In regression problems, f (X) = E(Y |X) will typically be nonlinear andnonadditive in X, and representing f (X) by a linear model is usually a convenient, and sometimes a necessary, approximation. Convenient because alinear model is easy to interpret, and is the first-order Taylor approximation to f (X). Sometimes necessary, because with N small and/or p large,a linear model might be all we are able to fit to the data without overfitting.
Likewise in classification, a linear, Bayes-optimal decision boundaryimplies that some monotone transformation of Pr(Y = 1|X) is linear in X.This is inevitably an approximation.In this chapter and the next we discuss popular methods for movingbeyond linearity. The core idea in this chapter is to augment/replace thevector of inputs X with additional variables, which are transformations ofX, and then use linear models in this new space of derived input features.Denote by hm (X) : IRp 7→ IR the mth transformation of X, m =1, .
. . , M . We then modelf (X) =MXm=1βm hm (X),(5.1)1405. Basis Expansions and Regularizationa linear basis expansion in X. The beauty of this approach is that once thebasis functions hm have been determined, the models are linear in thesenew variables, and the fitting proceeds as before.Some simple and widely used examples of the hm are the following:• hm (X) = Xm , m = 1, . . . , p recovers the original linear model.• hm (X) = Xj2 or hm (X) = Xj Xk allows us to augment the inputs withpolynomial terms to achieve higher-order Taylor expansions.
Note,however, that the number of variables grows exponentially in the degree of the polynomial. A full quadratic model in p variables requiresO(p2 ) square and cross-product terms, or more generally O(pd ) for adegree-d polynomial.p• hm (X) = log(Xj ), Xj , . . . permits other nonlinear transformationsof single inputs. More generally one can use similar functions involving several inputs, such as hm (X) = ||X||.• hm (X) = I(Lm ≤ Xk < Um ), an indicator for a region of Xk .
Bybreaking the range of Xk up into Mk such nonoverlapping regionsresults in a model with a piecewise constant contribution for Xk .Sometimes the problem at hand will call for particular basis functions hm ,such as logarithms or power functions. More often, however, we use the basisexpansions as a device to achieve more flexible representations for f (X).Polynomials are an example of the latter, although they are limited bytheir global nature—tweaking the coefficients to achieve a functional formin one region can cause the function to flap about madly in remote regions.In this chapter we consider more useful families of piecewise-polynomialsand splines that allow for local polynomial representations. We also discussthe wavelet bases, especially useful for modeling signals and images.
Thesemethods produce a dictionary D consisting of typically a very large number|D| of basis functions, far more than we can afford to fit to our data. Alongwith the dictionary we require a method for controlling the complexityof our model, using basis functions from the dictionary. There are threecommon approaches:• Restriction methods, where we decide before-hand to limit the classof functions. Additivity is an example, where we assume that ourmodel has the formf (X)=pXfj (Xj )j=1=Mjp XXj=1 m=1βjm hjm (Xj ).(5.2)5.2 Piecewise Polynomials and Splines141The size of the model is limited by the number of basis functions Mjused for each component function fj .• Selection methods, which adaptively scan the dictionary and includeonly those basis functions hm that contribute significantly to the fit ofthe model. Here the variable selection techniques discussed in Chapter 3 are useful.
The stagewise greedy approaches such as CART,MARS and boosting fall into this category as well.• Regularization methods where we use the entire dictionary but restrict the coefficients. Ridge regression is a simple example of a regularization approach, while the lasso is both a regularization and selection method. Here we discuss these and more sophisticated methodsfor regularization.5.2 Piecewise Polynomials and SplinesWe assume until Section 5.7 that X is one-dimensional. A piecewise polynomial function f (X) is obtained by dividing the domain of X into contiguous intervals, and representing f by a separate polynomial in each interval.Figure 5.1 shows two simple piecewise polynomials.
The first is piecewiseconstant, with three basis functions:h1 (X) = I(X < ξ1 ),h2 (X) = I(ξ1 ≤ X < ξ2 ),h3 (X) = I(ξ2 ≤ X).Since these are positiveP3 over disjoint regions, the least squares estimate ofthe model f (X) = m=1 βm hm (X) amounts to β̂m = Ȳm , the mean of Yin the mth region.The top right panel shows a piecewise linear fit. Three additional basisfunctions are needed: hm+3 = hm (X)X, m = 1, . . .