The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 21
Текст из файла (страница 21)
Linear Methods for Regressionregion for ridge regression is the disk β12 + β22 ≤ t, while that for lasso isthe diamond |β1 | + |β2 | ≤ t. Both methods find the first point where theelliptical contours hit the constraint region. Unlike the disk, the diamondhas corners; if the solution occurs at a corner, then it has one parameterβj equal to zero.
When p > 2, the diamond becomes a rhomboid, and hasmany corners, flat edges and faces; there are many more opportunities forthe estimated parameters to be zero.We can generalize ridge regression and the lasso, and view them as Bayesestimates. Consider the criterion(N)ppXXX2qβ̃ = argminyi − β0 −xij βj + λ|βj |(3.53)βi=1j=1j=1Pfor q ≥ 0. The contours of constant value of j |βj |q are shown in Figure 3.12, for the case of two inputs.Thinking of |βj |q as the log-prior density for βj , these are also the equicontours of the prior distribution of the parameters. The value q = 0 corresponds to variable subset selection, as the penalty simply counts the numberof nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridgeregression. Notice that for q ≤ 1, the prior is not uniform in direction, butconcentrates more mass in the coordinate directions. The prior corresponding to the q = 1 case is an independent double exponential (or Laplace)distribution for each input, with density (1/2τ ) exp(−|β|/τ ) and τ = 1/λ.The case q = 1 (lasso) is the smallest q such that the constraint regionis convex; non-convex constraint regions make the optimization problemmore difficult.In this view, the lasso, ridge regression and best subset selection areBayes estimates with different priors.
Note, however, that they are derivedas posterior modes, that is, maximizers of the posterior. It is more commonto use the mean of the posterior as the Bayes estimate. Ridge regression isalso the posterior mean, but the lasso and best subset selection are not.Looking again at the criterion (3.53), we might try using other valuesof q besides 0, 1, or 2. Although one might consider estimating q fromthe data, our experience is that it is not worth the effort for the extravariance incurred. Values of q ∈ (1, 2) suggest a compromise between thelasso and ridge regression. Although this is the case, with q > 1, |βj |q isdifferentiable at 0, and so does not share the ability of lasso (q = 1) forq=4q=2q=1FIGURE 3.12.
Contours of constant value ofq = 0.5Pjq = 0.1|βj |q for given values of q.3.4 Shrinkage Methodsq = 1.2Lq73α = 0.2Elastic NetPFIGURE 3.13. Contours of constant value of j |βj |q for q = 1.2 (left plot),Pand the elastic-net penalty j (αβj2 +(1−α)|βj |) for α = 0.2 (right plot). Althoughvisually very similar, the elastic-net has sharp (non-differentiable) corners, whilethe q = 1.2 penalty does not.setting coefficients exactly to zero. Partly for this reason as well as forcomputational tractability, Zou and Hastie (2005) introduced the elasticnet penaltypXαβj2 + (1 − α)|βj | ,(3.54)λj=1a different compromise between ridge and lasso. Figure 3.13 compares theLq penalty with q = 1.2 and the elastic-net penalty with α = 0.2; it ishard to detect the difference by eye.
The elastic-net selects variables likethe lasso, and shrinks together the coefficients of correlated predictors likeridge. It also has considerable computational advantages over the Lq penalties. We discuss the elastic-net further in Section 18.4.3.4.4 Least Angle RegressionLeast angle regression (LAR) is a relative newcomer (Efron et al., 2004),and can be viewed as a kind of “democratic” version of forward stepwiseregression (Section 3.3.2). As we will see, LAR is intimately connectedwith the lasso, and in fact provides an extremely efficient algorithm forcomputing the entire lasso path as in Figure 3.10.Forward stepwise regression builds a model sequentially, adding one variable at a time. At each step, it identifies the best variable to include in theactive set, and then updates the least squares fit to include all the activevariables.Least angle regression uses a similar strategy, but only enters “as much”of a predictor as it deserves.
At the first step it identifies the variablemost correlated with the response. Rather than fit this variable completely,LAR moves the coefficient of this variable continuously toward its leastsquares value (causing its correlation with the evolving residual to decreasein absolute value). As soon as another variable “catches up” in terms ofcorrelation with the residual, the process is paused.
The second variablethen joins the active set, and their coefficients are moved together in a waythat keeps their correlations tied and decreasing. This process is continued743. Linear Methods for Regressionuntil all the variables are in the model, and ends at the full least-squaresfit. Algorithm 3.2 provides the details. The termination condition in step 5requires some explanation. If p > N − 1, the LAR algorithm reaches a zeroresidual solution after N − 1 steps (the −1 is because we have centered thedata).Algorithm 3.2 Least Angle Regression.1. Standardize the predictors to have mean zero and unit norm. Startwith the residual r = y − ȳ, β1 , β2 , .
. . , βp = 0.2. Find the predictor xj most correlated with r.3. Move βj from 0 towards its least-squares coefficient hxj , ri, until someother competitor xk has as much correlation with the current residualas does xj .4. Move βj and βk in the direction defined by their joint least squarescoefficient of the current residual on (xj , xk ), until some other competitor xl has as much correlation with the current residual.5. Continue in this way until all p predictors have been entered. Aftermin(N − 1, p) steps, we arrive at the full least-squares solution.Suppose Ak is the active set of variables at the beginning of the kthstep, and let βAk be the coefficient vector for these variables at this step;there will be k − 1 nonzero values, and the one just entered will be zero.
Ifrk = y − XAk βAk is the current residual, then the direction for this step isδk = (XTAk XAk )−1 XTAk rk .(3.55)The coefficient profile then evolves as βAk (α) = βAk + α · δk . Exercise 3.23verifies that the directions chosen in this fashion do what is claimed: keepthe correlations tied and decreasing. If the fit vector at the beginning ofthis step is f̂k , then it evolves as f̂k (α) = f̂k + α · uk , where uk = XAk δkis the new fit direction. The name “least angle” arises from a geometricalinterpretation of this process; uk makes the smallest (and equal) anglewith each of the predictors in Ak (Exercise 3.24).
Figure 3.14 shows theabsolute correlations decreasing and joining ranks with each step of theLAR algorithm, using simulated data.By construction the coefficients in LAR change in a piecewise linear fashion. Figure 3.15 [left panel] shows the LAR coefficient profile evolving as afunction of their L1 arc length 2 . Note that we do not need to take small2 TheL1 arc-length of a differentiable curve β(s) for s ∈ [0, S] is given by TV(β, S) =||β̇(s)||1 ds, where β̇(s) = ∂β(s)/∂s. For the piecewise-linear LAR coefficient profile,0this amounts to summing the L1 norms of the changes in coefficients from step to step.RS3.4 Shrinkage Methodsv6v4v5v3v10.30.20.10.0Absolute Correlations0.4v275051015L1 Arc LengthFIGURE 3.14.
Progression of the absolute correlations during each step of theLAR procedure, using a simulated data set with six predictors. The labels at thetop of the plot indicate which variables enter the active set at each step. The steplength are measured in units of L1 arc length.0.0−0.5Coefficients−1.5−1.00.0−0.5−1.0−1.5Coefficients0.5Lasso0.5Least Angle Regression0510L1 Arc Length15051015L1 Arc LengthFIGURE 3.15. Left panel shows the LAR coefficient profiles on the simulateddata, as a function of the L1 arc length. The right panel shows the Lasso profile.They are identical until the dark-blue coefficient crosses zero at an arc length ofabout 18.763. Linear Methods for Regressionsteps and recheck the correlations in step 3; using knowledge of the covariance of the predictors and the piecewise linearity of the algorithm, we canwork out the exact step length at the beginning of each step (Exercise 3.25).The right panel of Figure 3.15 shows the lasso coefficient profiles on thesame data.
They are almost identical to those in the left panel, and differfor the first time when the blue coefficient passes back through zero. For theprostate data, the LAR coefficient profile turns out to be identical to thelasso profile in Figure 3.10, which never crosses zero. These observationslead to a simple modification of the LAR algorithm that gives the entirelasso path, which is also piecewise-linear.Algorithm 3.2a Least Angle Regression: Lasso Modification.4a.
If a non-zero coefficient hits zero, drop its variable from the active setof variables and recompute the current joint least squares direction.The LAR(lasso) algorithm is extremely efficient, requiring the same orderof computation as that of a single least squares fit using the p predictors.Least angle regression always takes p steps to get to the full least squaresestimates.