The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 25
Текст из файла (страница 25)
However for smaller values of t, the DS procedure produces adifferent path of solutions than the lasso.Candes and Tao (2007) show that the solution to DS is a linear programming problem; hence the name Dantzig selector, in honor of the late903. Linear Methods for RegressionGeorge Dantzig, the inventor of the simplex method for linear programming. They also prove a number of interesting mathematical properties forthe method, related to its ability to recover an underlying sparse coefficient vector.
These same properties also hold for the lasso, as shown laterby Bickel et al. (2008).Unfortunately the operating properties of the DS method are somewhatunsatisfactory. The method seems similar in spirit to the lasso, especiallywhen we look at the lasso’s stationary conditions (3.58). Like the LAR algorithm, the lasso maintains the same inner product (and correlation) withthe current residual for all variables in the active set, and moves their coefficients to optimally decrease the residual sum of squares. In the process,this common correlation is decreased monotonically (Exercise 3.23), and atall times this correlation is larger than that for non-active variables.
TheDantzig selector instead tries to minimize the maximum inner product ofthe current residual with all the predictors. Hence it can achieve a smallermaximum than the lasso, but in the process a curious phenomenon canoccur. If the size of the active set is m, there will be m variables tied withmaximum correlation. However, these need not coincide with the active set!Hence it can include a variable in the model that has smaller correlationwith the current residual than some of the excluded variables (Efron etal., 2007). This seems unreasonable and may be responsible for its sometimes inferior prediction accuracy.
Efron et al. (2007) also show that DScan yield extremely erratic coefficient paths as the regularization parameters is varied.3.8.4 The Grouped LassoIn some problems, the predictors belong to pre-defined groups; for examplegenes that belong to the same biological pathway, or collections of indicator(dummy) variables for representing the levels of a categorical predictor.
Inthis situation it may be desirable to shrink and select the members of agroup together. The grouped lasso is one way to achieve this. Suppose thatthe p predictors are divided into L groups, with pℓ the number in groupℓ. For ease of notation, we use a matrix Xℓ to represent the predictorscorresponding to the ℓth group, with corresponding coefficient vector βℓ .The grouped-lasso minimizes the convex criterion!LLXX√2(3.80)pℓ ||βℓ ||2 ,minp ||y − β0 1 −Xℓ βℓ ||2 + λβ∈IRℓ=1ℓ=1√where the pℓ terms accounts for the varying group sizes, and || · ||2 isthe Euclidean norm (not squared).
Since the Euclidean norm of a vectorβℓ is zero only if all of its components are zero, this procedure encouragessparsity at both the group and individual levels. That is, for some values ofλ, an entire group of predictors may drop out of the model. This procedure3.8 More on the Lasso and Related Path Algorithms91was proposed by Bakin (1999) and Lin and Zhang (2006), and studied andgeneralized by Yuan and Lin (2007). Generalizations include more generalL2 norms ||η||K = (η T Kη)1/2 , as well as allowing overlapping groups ofpredictors (Zhao et al., 2008).
There are also connections to methods forfitting sparse additive models (Lin and Zhang, 2006; Ravikumar et al.,2008).3.8.5 Further Properties of the LassoA number of authors have studied the ability of the lasso and related procedures to recover the correct model, as N and p grow. Examples of thiswork include Knight and Fu (2000), Greenshtein and Ritov (2004), Tropp(2004), Donoho (2006b), Meinshausen (2007), Meinshausen and Bühlmann(2006), Tropp (2006), Zhao and Yu (2006), Wainwright (2006), and Buneaet al.
(2007). For example Donoho (2006b) focuses on the p > N case andconsiders the lasso solution as the bound t gets large. In the limit this givesthe solution with minimum L1 norm among all models with zero trainingerror. He shows that under certain assumptions on the model matrix X, ifthe true model is sparse, this solution identifies the correct predictors withhigh probability.Many of the results in this area assume a condition on the model matrixof the formmaxc ||xTj XS (XS T XS )−1 ||1 ≤ (1 − ǫ) for some ǫ ∈ (0, 1].j∈S(3.81)Here S indexes the subset of features with non-zero coefficients in the trueunderlying model, and XS are the columns of X corresponding to thosefeatures. Similarly S c are the features with true coefficients equal to zero,and XS c the corresponding columns.
This says that the least squares coefficients for the columns of XS c on XS are not too large, that is, the “good”variables S are not too highly correlated with the nuisance variables S c .Regarding the coefficients themselves, the lasso shrinkage causes the estimates of the non-zero coefficients to be biased towards zero, and in generalthey are not consistent5 . One approach for reducing this bias is to runthe lasso to identify the set of non-zero coefficients, and then fit an unrestricted linear model to the selected set of features. This is not alwaysfeasible, if the selected set is large.
Alternatively, one can use the lasso toselect the set of non-zero predictors, and then apply the lasso again, butusing only the selected predictors from the first step. This is known as therelaxed lasso (Meinshausen, 2007). The idea is to use cross-validation toestimate the initial penalty parameter for the lasso, and then again for asecond penalty parameter applied to the selected set of predictors. Since5 Statistical consistency means as the sample size grows, the estimates converge tothe true values.923. Linear Methods for Regressionthe variables in the second step have less “competition” from noise variables, cross-validation will tend to pick a smaller value for λ, and hencetheir coefficients will be shrunken less than those in the initial estimate.Alternatively, one can modify the lasso penalty function so that larger coefficients are shrunken less severely; the smoothly clipped absolute deviation(SCAD) penalty of Fan and Li (2005) replaces λ|β| by Ja (β, λ), wherehidJa (β, λ)(aλ − |β|)+= λ · sign(β) I(|β| ≤ λ) +I(|β| > λ)(3.82)dβ(a − 1)λfor some a ≥ 2.
The second term in square-braces reduces the amount ofshrinkage in the lasso for larger values of β, with ultimately no shrinkageas a → ∞. Figure 3.20 shows the SCAD penalty, along with the lasso and|β||β|1−ν02.01.51.00.50.010.521.031.542.052.5SCAD−4−20β24−4−2024β−4−2024βFIGURE 3.20. The lasso and two alternative non-convex penalties designed topenalize large coefficients less. For SCAD we use λ = 1 and a = 4, and ν = 12 inthe last panel.|β|1−ν . However this criterion is non-convex, which is a drawback since itmakes the computation much more difficult. The adaptive lasso (Zou, 2006)Ppuses a weighted penalty of the form j=1 wj |βj | where wj = 1/|β̂j |ν , β̂j isthe ordinary least squares estimate and ν > 0. This is a practical approximation to the |β|q penalties (q = 1 − ν here) discussed in Section 3.4.3. Theadaptive lasso yields consistent estimates of the parameters while retainingthe attractive convexity property of the lasso.3.8.6 Pathwise Coordinate OptimizationAn alternate approach to the LARS algorithm for computing the lassosolution is simple coordinate descent.
This idea was proposed by Fu (1998)and Daubechies et al. (2004), and later studied and generalized by Friedmanet al. (2007), Wu and Lange (2008) and others. The idea is to fix the penaltyparameter λ in the Lagrangian form (3.52) and optimize successively overeach parameter, holding the other parameters fixed at their current values.Suppose the predictors are all standardized to have mean zero and unitnorm. Denote by β̃k (λ) the current estimate for βk at penalty parameter3.9 Computational Considerations93λ.
We can rearrange (3.52) to isolate βj ,NX1XR(β̃(λ), βj ) =yi −xik β̃k (λ) − xij βj2 i=1k6=j!2+λXk6=j|β̃k (λ)| + λ|βj |,(3.83)where we have suppressed the intercept and introduced a factor 12 for convenience. This can be viewed as a univariate lasso problem with responseP(j)variable the partial residual yi − ỹi = yi − k6=j xik β̃k (λ).