The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 20
Текст из файла (страница 20)
Note that UT y are the coordinates of y withrespect to the orthonormal basis U. Note also the similarity with (3.33);Q and U are generally different orthogonal bases for the column space ofX (Exercise 3.8).Now the ridge solutions areXβ̂ ridge=X(XT X + λI)−1 XT y=U D(D2 + λI)−1 D UT ypXd2juj 2uTj y,d+λjj=1=(3.47)where the uj are the columns of U. Note that since λ ≥ 0, we have d2j /(d2j +λ) ≤ 1. Like linear regression, ridge regression computes the coordinates ofy with respect to the orthonormal basis U. It then shrinks these coordinatesby the factors d2j /(d2j + λ). This means that a greater amount of shrinkageis applied to the coordinates of basis vectors with smaller d2j .What does a small value of d2j mean? The SVD of the centered matrixX is another way of expressing the principal components of the variablesin X.
The sample covariance matrix is given by S = XT X/N , and from(3.45) we haveXT X = VD2 VT ,(3.48)which is the eigen decomposition of XT X (and of S, up to a factor N ).The eigenvectors vj (columns of V) are also called the principal components (or Karhunen–Loeve) directions of X. The first principal componentdirection v1 has the property that z1 = Xv1 has the largest sample variance amongst all normalized linear combinations of the columns of X.
Thissample variance is easily seen to beVar(z1 ) = Var(Xv1 ) =d21,N(3.49)and in fact z1 = Xv1 = u1 d1 . The derived variable z1 is called the firstprincipal component of X, and hence u1 is the normalized first principal6743.4 Shrinkage MethodsLargest PrincipalComponentoo02o o oo ooo oo ooooooo ooooo oo oooooo oo o o ooo o o ooo ooooo oo o oooo ooooo o ooo o ooooooo ooo o o oooo oo oooo ooooo oo oooo oooo o ooo o o ooooo oo o ooooo o oooooooooooo oooooo oo oooooooo oooo ooo oo ooo o oooo ooo ooooooooooo o oSmallest Principaloo ooComponent-2X2oo-4o-4-2024X1FIGURE 3.9. Principal components of some input data points.
The largest principal component is the direction that maximizes the variance of the projected data,and the smallest principal component minimizes that variance. Ridge regressionprojects y onto these components, and then shrinks the coefficients of the low–variance components more than the high-variance components.component. Subsequent principal components zj have maximum varianced2j /N , subject to being orthogonal to the earlier ones. Conversely the lastprincipal component has minimum variance. Hence the small singular values dj correspond to directions in the column space of X having smallvariance, and ridge regression shrinks these directions the most.Figure 3.9 illustrates the principal components of some data points intwo dimensions.
If we consider fitting a linear surface over this domain(the Y -axis is sticking out of the page), the configuration of the data allowus to determine its gradient more accurately in the long direction thanthe short. Ridge regression protects against the potentially high varianceof gradients estimated in the short directions. The implicit assumption isthat the response will tend to vary most in the directions of high varianceof the inputs. This is often a reasonable assumption, since predictors areoften chosen for study because they vary with the response variable, butneed not hold in general.683. Linear Methods for RegressionIn Figure 3.7 we have plotted the estimated prediction error versus thequantitydf(λ)tr[X(XT X + λI)−1 XT ],tr(Hλ )pXd2j.d2 + λj=1 j===(3.50)This monotone decreasing function of λ is the effective degrees of freedomof the ridge regression fit.
Usually in a linear-regression fit with p variables,the degrees-of-freedom of the fit is p, the number of free parameters. Theidea is that although all p coefficients in a ridge fit will be non-zero, theyare fit in a restricted fashion controlled by λ. Note that df(λ) = p whenλ = 0 (no regularization) and df(λ) → 0 as λ → ∞.
Of course thereis always an additional one degree of freedom for the intercept, which wasremoved apriori. This definition is motivated in more detail in Section 3.4.4and Sections 7.4–7.6. In Figure 3.7 the minimum occurs at df(λ) = 5.0.Table 3.3 shows that ridge regression reduces the test error of the full leastsquares estimates by a small amount.3.4.2 The LassoThe lasso is a shrinkage method like ridge, with subtle but important differences. The lasso estimate is defined byβ̂ lasso=argminβN Xi=1subject toyi − β0 −pXj=1pXj=1xij βj2|βj | ≤ t.(3.51)Just as in ridge regression, we can re-parametrize the constant β0 by standardizing the predictors; the solution for β̂0 is ȳ, and thereafter we fit amodel without an intercept (Exercise 3.5). In the signal processing literature, the lasso is also known as basis pursuit (Chen et al., 1998).We can also write the lasso problem in the equivalent Lagrangian formβ̂lasso XppNXX21yi − β0 −= argminxij βj + λ|βj | .2 i=1βj=1j=1(3.52)Notice the similarityPp to the ridge regression problem (3.42) orPp(3.41): theL2 ridge penalty 1 βj2 is replaced by the L1 lasso penalty 1 |βj |.
Thislatter constraint makes the solutions nonlinear in the yi , and there is noclosed form expression as in ridge regression. Computing the lasso solution3.4 Shrinkage Methods69is a quadratic programming problem, although we see in Section 3.4.4 thatefficient algorithms are available for computing the entire path of solutionsas λ is varied, with the same computational cost as for ridge regression.Because of the nature of the constraint, making t sufficiently small willcause some of the coefficients to be exactly zero.
Thus the lassoPp does a kindof continuous subset selection. If t is chosen larger than t0 = 1 |β̂j | (whereβ̂j = β̂jls , the least squares estimates), then the lasso estimates are the β̂j ’s.On the other hand, for t = t0 /2 say, then the least squares coefficients areshrunk by about 50% on average. However, the nature of the shrinkageis not obvious, and we investigate it further in Section 3.4.4 below. Likethe subset size in variable subset selection, or the penalty parameter inridge regression, t should be adaptively chosen to minimize an estimate ofexpected prediction error.In Figure 3.7, for ease of interpretation, we have plotted the lassoPp prediction error estimates versus the standardized parameter s = t/ 1 |β̂j |.A value ŝ ≈ 0.36 was chosen by 10-fold cross-validation; this caused fourcoefficients to be set to zero (fifth column of Table 3.3). The resultingmodel has the second lowest test error, slightly lower than the full leastsquares model, but the standard errors of the test error estimates (last lineof Table 3.3) are fairly large.Figure 3.10 showsPp the lasso coefficients as the standardized tuning parameter s = t/ 1 |β̂j | is varied.
At s = 1.0 these are the least squaresestimates; they decrease to 0 as s → 0. This decrease is not always strictlymonotonic, although it is in this example. A vertical line is drawn ats = 0.36, the value chosen by cross-validation.3.4.3 Discussion: Subset Selection, Ridge Regression and theLassoIn this section we discuss and compare the three approaches discussed so farfor restricting the linear regression model: subset selection, ridge regressionand the lasso.In the case of an orthonormal input matrix X the three procedures haveexplicit solutions.
Each method applies a simple transformation to the leastsquares estimate β̂j , as detailed in Table 3.4.Ridge regression does a proportional shrinkage. Lasso translates eachcoefficient by a constant factor λ, truncating at zero. This is called “softthresholding,” and is used in the context of wavelet-based smoothing in Section 5.9. Best-subset selection drops all variables with coefficients smallerthan the M th largest; this is a form of “hard-thresholding.”Back to the nonorthogonal case; some pictures help understand their relationship.
Figure 3.11 depicts the lasso (left) and ridge regression (right)when there are only two parameters. The residual sum of squares has elliptical contours, centered at the full least squares estimate. The constraint703. Linear Methods for Regressionsvilweightpgg450.2lbph0.0Coefficients0.40.6lcavolgleason−0.2agelcp0.00.20.40.60.81.0Shrinkage Factor sFIGURE 3.10. Profiles of lasso coefficients,as the tuning parameter t is varied.PCoefficients are plotted versus s = t/ p1 |β̂j |. A vertical line is drawn at s = 0.36,the value chosen by cross-validation. Compare Figure 3.8 on page 65; the lassoprofiles hit zero, while those for ridge do not.
The profiles are piece-wise linear,and so are computed only at the points displayed; see Section 3.4.4 for details.3.4 Shrinkage Methods71TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.EstimatorFormulaBest subset (size M )β̂j · I(|β̂j | ≥ |β̂(M ) |)Ridgeβ̂j /(1 + λ)Lassosign(β̂j )(|β̂j | − λ)+Best SubsetRidgeLassoλ|β̂(M ) |(0,0)β2(0,0)^β.(0,0)β2β1^β.β1FIGURE 3.11.
Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1 | + |β2 | ≤ t and β12 + β22 ≤ t2 , respectively,while the red ellipses are the contours of the least squares error function.723.