The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 27
Текст из файла (страница 27)
Let RSS be the residual sum-of-squares from the full least squares fit.(a) Show that1|hxj , y − u(α)i| = (1 − α)λ, j = 1, . . . , p,Nand hence the correlations of each xj with the residuals remain equalin magnitude as we progress toward u.(b) Show that these correlations are all equal toλ(α) = q(1 − α)(1 − α)2 +α(2−α)N· RSS· λ,and hence they decrease monotonically to zero.(c) Use these results to show that the LAR algorithm in Section 3.4.4keeps the correlations tied and monotonically decreasing, as claimedin (3.55).Ex. 3.24 LAR directions.
Using the notation around equation (3.55) onpage 74, show that the LAR direction makes an equal angle with each ofthe predictors in Ak .Ex. 3.25 LAR look-ahead (Efron et al., 2004, Sec. 2). Starting at the beginning of the kth step of the LAR algorithm, derive expressions to identifythe next variable to enter the active set at step k + 1, and the value of α atwhich this occurs (using the notation around equation (3.55) on page 74).Ex.
3.26 Forward stepwise regression enters the variable at each step thatmost reduces the residual sum-of-squares. LAR adjusts variables that havethe most (absolute) correlation with the current residuals. Show that thesetwo entry criteria are not necessarily the same. [Hint: let xj.A be the jth983. Linear Methods for Regressionvariable, linearly adjusted for all the variables currently in the model.
Showthat the first criterion amounts to identifying the j for which Cor(xj.A , r)is largest in magnitude.Ex. 3.27 Lasso and LAR:P ConsiderP the lasso problem in Lagrange multiplierform: with L(β) = 21 i (yi − j xij βj )2 , we minimizeXL(β) + λ|βj |(3.88)jfor fixed λ > 0.(a) Setting βj = βj+ − βj− with βj+ , βj− ≥ 0, expression (3.88) becomesPL(β) + λ j (βj+ + βj− ). Show that the Lagrange dual function isXXX+−L(β) + λ(βj+ + βj− ) −λ+λ−(3.89)j βj −j βjjjjand the Karush–Kuhn–Tucker optimality conditions are∇L(β)j + λ − λ+j−∇L(β)j +λ − λ−j+ +λj βj−λ−j βj=0=0=0=0,along with the non-negativity constraints on the parameters and allthe Lagrange multipliers.(b) Show that |∇L(β)j | ≤ λ ∀j, and that the KKT conditions imply oneof the following three scenarios:λ=0 ⇒βj+ > 0, λ > 0 ⇒βj− > 0, λ > 0 ⇒∇L(β)j = 0 ∀j−λ+j = 0, ∇L(β)j = −λ < 0, βj = 0+λ−j = 0, ∇L(β)j = λ > 0, βj = 0.Hence show that for any “active” predictor having βj 6= 0, we musthave ∇L(β)j = −λ if βj > 0, and ∇L(β)j = λ if βj < 0.
Assumingthe predictors are standardized, relate λ to the correlation betweenthe jth predictor and the current residuals.(c) Suppose that the set of active predictors is unchanged for λ0 ≥ λ ≥ λ1 .Show that there is a vector γ0 such thatβ̂(λ) = β̂(λ0 ) − (λ − λ0 )γ0(3.90)Thus the lasso solution path is linear as λ ranges from λ0 to λ1 (Efronet al., 2004; Rosset and Zhu, 2007).Exercises99Ex.
3.28 Suppose for a given t in (3.51), the fitted lasso coefficient forvariable Xj is β̂j = a. Suppose we augment our set of variables with anidentical copy Xj∗ = Xj . Characterize the effect of this exact collinearityby describing the set of solutions for β̂j and β̂j∗ , using the same value of t.Ex. 3.29 Suppose we run a ridge regression with parameter λ on a singlevariable X, and get coefficient a. We now include an exact copy X ∗ = X,and refit our ridge regression.
Show that both coefficients are identical, andderive their value. Show in general that if m copies of a variable Xj areincluded in a ridge regression, their coefficients are all the same.Ex. 3.30 Consider the elastic-net optimization problem:min ||y − Xβ||2 + λ α||β||22 + (1 − α)||β||1 .β(3.91)Show how one can turn this into a lasso problem, using an augmentedversion of X and y.1003. Linear Methods for RegressionThis is page 101Printer: Opaque this4Linear Methods for Classification4.1 IntroductionIn this chapter we revisit the classification problem and focus on linearmethods for classification.
Since our predictor G(x) takes values in a discrete set G, we can always divide the input space into a collection of regionslabeled according to the classification. We saw in Chapter 2 that the boundaries of these regions can be rough or smooth, depending on the predictionfunction. For an important class of procedures, these decision boundariesare linear; this is what we will mean by linear methods for classification.There are several different ways in which linear decision boundaries canbe found. In Chapter 2 we fit linear regression models to the class indicatorvariables, and classify to the largest fit.
Suppose there are K classes, forconvenience labeled 1, 2, . . . , K, and the fitted linear model for the kthindicator response variable is fˆk (x) = β̂k0 + β̂kT x. The decision boundarybetween class k and ℓ is that set of points for which fˆk (x) = fˆℓ (x), that is,the set {x : (β̂k0 − β̂ℓ0 ) + (β̂k − β̂ℓ )T x = 0}, an affine set or hyperplane.1Since the same is true for any pair of classes, the input space is dividedinto regions of constant classification, with piecewise hyperplanar decisionboundaries. This regression approach is a member of a class of methodsthat model discriminant functions δk (x) for each class, and then classify xto the class with the largest value for its discriminant function. Methods1 Strictly speaking, a hyperplane passes through the origin, while an affine set neednot.
We sometimes ignore the distinction and refer in general to hyperplanes.1024. Linear Methods for Classificationthat model the posterior probabilities Pr(G = k|X = x) are also in thisclass. Clearly, if either the δk (x) or Pr(G = k|X = x) are linear in x, thenthe decision boundaries will be linear.Actually, all we require is that some monotone transformation of δk orPr(G = k|X = x) be linear for the decision boundaries to be linear. Forexample, if there are two classes, a popular model for the posterior probabilities isexp(β0 + β T x),1 + exp(β0 + β T x)1.Pr(G = 2|X = x) =1 + exp(β0 + β T x)Pr(G = 1|X = x) =(4.1)Here the monotone transformation is the logit transformation: log[p/(1−p)],and in fact we see thatlogPr(G = 1|X = x)= β0 + β T x.Pr(G = 2|X = x)(4.2)The decision boundary is the set ofpoints for whichthe log-odds are zero,and this is a hyperplane defined by x|β0 + β T x = 0 .
We discuss two verypopular but different methods that result in linear log-odds or logits: lineardiscriminant analysis and linear logistic regression. Although they differ intheir derivation, the essential difference between them is in the way thelinear function is fit to the training data.A more direct approach is to explicitly model the boundaries betweenthe classes as linear. For a two-class problem in a p-dimensional inputspace, this amounts to modeling the decision boundary as a hyperplane—inother words, a normal vector and a cut-point.
We will look at two methodsthat explicitly look for “separating hyperplanes.” The first is the wellknown perceptron model of Rosenblatt (1958), with an algorithm that findsa separating hyperplane in the training data, if one exists. The secondmethod, due to Vapnik (1996), finds an optimally separating hyperplane ifone exists, else finds a hyperplane that minimizes some measure of overlapin the training data.
We treat the separable case here, and defer treatmentof the nonseparable case to Chapter 12.While this entire chapter is devoted to linear decision boundaries, there isconsiderable scope for generalization. For example, we can expand our variable set X1 , . . . , Xp by including their squares and cross-products X12 , X22 , . . . ,X1 X2 , .
. ., thereby adding p(p + 1)/2 additional variables. Linear functionsin the augmented space map down to quadratic functions in the originalspace—hence linear decision boundaries to quadratic decision boundaries.Figure 4.1 illustrates the idea. The data are the same: the left plot useslinear decision boundaries in the two-dimensional space shown, while theright plot uses linear decision boundaries in the augmented five-dimensionalspace described above.
This approach can be used with any basis transfor-4.2 Linear Regression of an Indicator Matrix12210312112132212 22 232 2233 3 3 322 22122333 32 2 22 2 22 23 332 222 2122322 21223 322 22 2 222 2 222 22222 2 23 3 332222 2222 2222332 2222 2 2 2 2222222 213 33 3332222223222223223 3 3333 3 322 22 2 2 22 22 2 222333 33222132 22 222222 2222222223 333 3 31 2 22222 21222222222222213 33333321 22 1212211 3 331 13122 122122 22122212 1 211 1 3 33 33222 2112211 22 1 1 11 11 11 1 1 1 11 1133333333331 223 331 2211211311 11221 1 1 133 333 33331211 1 1 1 12211331112333333333 32 11111 111 11 2 1111 1 11 1 1 1 333333311 1 1 13333 3333133 31 1 1111 3331 1 11 1 11 1111 111 1 1 1111 1111 1313 1111 1 1331 11 11 1 3333333331131 33 3 31 11 11 33333333333311 1 1 1 1 111111 11 111 1 1 1 11 11 11333 3 33333311 31133311 13331131 1 113 33113333 333 32112132212 22 232 2233 3 3 322 22122333 32 2 22 2 22 23 332 222 2122322 21223 322 22 2 222 2 222 22222 2 23 3 332222 2222 2222332 2222 2 2 2 2222222 213 33 3332222223222223223 3 3333 3 322 22 2 2 22 22 2 222333 33222132 22 222222 2222222223 333 3 31 2 22222 21222222222222213 33333321 22 1212211 3 331 13122 122122 22122212 1 211 1 3 33 33222 2112211 22 1 1 11 11 11 1 1 1 11 1133333333331 223 331 2211211311 11221 1 1 133 333 33331211 1 1 1 12211311312333333333 32 11111 111 11 2 1111 1 11 1 1 1 333333311 1 1 13333 3333133 31 1 1111 3331 1 11 1 11 1111 111 1 1 1111 1111 1313 1111 1 1331 11 11 1 3333333331131 33 3 31 11 11 33333333333311 1 1 1 1 111111 11 111 1 1 1 11 11 11333 3 33333311 31133311 13331131 1 113 33113333 333 3FIGURE 4.1.
The left plot shows some data from three classes, with lineardecision boundaries found by linear discriminant analysis. The right plot showsquadratic decision boundaries. These were obtained by finding linear boundariesin the five-dimensional space X1 , X2 , X1 X2 , X12 , X22 . Linear inequalities in thisspace are quadratic inequalities in the original space.mation h(X) where h : IRp 7→ IRq with q > p, and will be explored in laterchapters.4.2 Linear Regression of an Indicator MatrixHere each of the response categories are coded via an indicator variable.Thus if G has K classes, there will be K such indicators Yk , k = 1, .