The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 20

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 20 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 202020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 20)

Note that UT y are the coordinates of y withrespect to the orthonormal basis U. Note also the similarity with (3.33);Q and U are generally different orthogonal bases for the column space ofX (Exercise 3.8).Now the ridge solutions areXβ̂ ridge=X(XT X + λI)−1 XT y=U D(D2 + λI)−1 D UT ypXd2juj 2uTj y,d+λjj=1=(3.47)where the uj are the columns of U. Note that since λ ≥ 0, we have d2j /(d2j +λ) ≤ 1. Like linear regression, ridge regression computes the coordinates ofy with respect to the orthonormal basis U. It then shrinks these coordinatesby the factors d2j /(d2j + λ). This means that a greater amount of shrinkageis applied to the coordinates of basis vectors with smaller d2j .What does a small value of d2j mean? The SVD of the centered matrixX is another way of expressing the principal components of the variablesin X.

The sample covariance matrix is given by S = XT X/N , and from(3.45) we haveXT X = VD2 VT ,(3.48)which is the eigen decomposition of XT X (and of S, up to a factor N ).The eigenvectors vj (columns of V) are also called the principal components (or Karhunen–Loeve) directions of X. The first principal componentdirection v1 has the property that z1 = Xv1 has the largest sample variance amongst all normalized linear combinations of the columns of X.

Thissample variance is easily seen to beVar(z1 ) = Var(Xv1 ) =d21,N(3.49)and in fact z1 = Xv1 = u1 d1 . The derived variable z1 is called the firstprincipal component of X, and hence u1 is the normalized first principal6743.4 Shrinkage MethodsLargest PrincipalComponentoo02o o oo ooo oo ooooooo ooooo oo oooooo oo o o ooo o o ooo ooooo oo o oooo ooooo o ooo o ooooooo ooo o o oooo oo oooo ooooo oo oooo oooo o ooo o o ooooo oo o ooooo o oooooooooooo oooooo oo oooooooo oooo ooo oo ooo o oooo ooo ooooooooooo o oSmallest Principaloo ooComponent-2X2oo-4o-4-2024X1FIGURE 3.9. Principal components of some input data points.

The largest principal component is the direction that maximizes the variance of the projected data,and the smallest principal component minimizes that variance. Ridge regressionprojects y onto these components, and then shrinks the coefficients of the low–variance components more than the high-variance components.component. Subsequent principal components zj have maximum varianced2j /N , subject to being orthogonal to the earlier ones. Conversely the lastprincipal component has minimum variance. Hence the small singular values dj correspond to directions in the column space of X having smallvariance, and ridge regression shrinks these directions the most.Figure 3.9 illustrates the principal components of some data points intwo dimensions.

If we consider fitting a linear surface over this domain(the Y -axis is sticking out of the page), the configuration of the data allowus to determine its gradient more accurately in the long direction thanthe short. Ridge regression protects against the potentially high varianceof gradients estimated in the short directions. The implicit assumption isthat the response will tend to vary most in the directions of high varianceof the inputs. This is often a reasonable assumption, since predictors areoften chosen for study because they vary with the response variable, butneed not hold in general.683. Linear Methods for RegressionIn Figure 3.7 we have plotted the estimated prediction error versus thequantitydf(λ)tr[X(XT X + λI)−1 XT ],tr(Hλ )pXd2j.d2 + λj=1 j===(3.50)This monotone decreasing function of λ is the effective degrees of freedomof the ridge regression fit.

Usually in a linear-regression fit with p variables,the degrees-of-freedom of the fit is p, the number of free parameters. Theidea is that although all p coefficients in a ridge fit will be non-zero, theyare fit in a restricted fashion controlled by λ. Note that df(λ) = p whenλ = 0 (no regularization) and df(λ) → 0 as λ → ∞.

Of course thereis always an additional one degree of freedom for the intercept, which wasremoved apriori. This definition is motivated in more detail in Section 3.4.4and Sections 7.4–7.6. In Figure 3.7 the minimum occurs at df(λ) = 5.0.Table 3.3 shows that ridge regression reduces the test error of the full leastsquares estimates by a small amount.3.4.2 The LassoThe lasso is a shrinkage method like ridge, with subtle but important differences. The lasso estimate is defined byβ̂ lasso=argminβN Xi=1subject toyi − β0 −pXj=1pXj=1xij βj2|βj | ≤ t.(3.51)Just as in ridge regression, we can re-parametrize the constant β0 by standardizing the predictors; the solution for β̂0 is ȳ, and thereafter we fit amodel without an intercept (Exercise 3.5). In the signal processing literature, the lasso is also known as basis pursuit (Chen et al., 1998).We can also write the lasso problem in the equivalent Lagrangian formβ̂lasso XppNXX21yi − β0 −= argminxij βj + λ|βj | .2 i=1βj=1j=1(3.52)Notice the similarityPp to the ridge regression problem (3.42) orPp(3.41): theL2 ridge penalty 1 βj2 is replaced by the L1 lasso penalty 1 |βj |.

Thislatter constraint makes the solutions nonlinear in the yi , and there is noclosed form expression as in ridge regression. Computing the lasso solution3.4 Shrinkage Methods69is a quadratic programming problem, although we see in Section 3.4.4 thatefficient algorithms are available for computing the entire path of solutionsas λ is varied, with the same computational cost as for ridge regression.Because of the nature of the constraint, making t sufficiently small willcause some of the coefficients to be exactly zero.

Thus the lassoPp does a kindof continuous subset selection. If t is chosen larger than t0 = 1 |β̂j | (whereβ̂j = β̂jls , the least squares estimates), then the lasso estimates are the β̂j ’s.On the other hand, for t = t0 /2 say, then the least squares coefficients areshrunk by about 50% on average. However, the nature of the shrinkageis not obvious, and we investigate it further in Section 3.4.4 below. Likethe subset size in variable subset selection, or the penalty parameter inridge regression, t should be adaptively chosen to minimize an estimate ofexpected prediction error.In Figure 3.7, for ease of interpretation, we have plotted the lassoPp prediction error estimates versus the standardized parameter s = t/ 1 |β̂j |.A value ŝ ≈ 0.36 was chosen by 10-fold cross-validation; this caused fourcoefficients to be set to zero (fifth column of Table 3.3). The resultingmodel has the second lowest test error, slightly lower than the full leastsquares model, but the standard errors of the test error estimates (last lineof Table 3.3) are fairly large.Figure 3.10 showsPp the lasso coefficients as the standardized tuning parameter s = t/ 1 |β̂j | is varied.

At s = 1.0 these are the least squaresestimates; they decrease to 0 as s → 0. This decrease is not always strictlymonotonic, although it is in this example. A vertical line is drawn ats = 0.36, the value chosen by cross-validation.3.4.3 Discussion: Subset Selection, Ridge Regression and theLassoIn this section we discuss and compare the three approaches discussed so farfor restricting the linear regression model: subset selection, ridge regressionand the lasso.In the case of an orthonormal input matrix X the three procedures haveexplicit solutions.

Each method applies a simple transformation to the leastsquares estimate β̂j , as detailed in Table 3.4.Ridge regression does a proportional shrinkage. Lasso translates eachcoefficient by a constant factor λ, truncating at zero. This is called “softthresholding,” and is used in the context of wavelet-based smoothing in Section 5.9. Best-subset selection drops all variables with coefficients smallerthan the M th largest; this is a form of “hard-thresholding.”Back to the nonorthogonal case; some pictures help understand their relationship.

Figure 3.11 depicts the lasso (left) and ridge regression (right)when there are only two parameters. The residual sum of squares has elliptical contours, centered at the full least squares estimate. The constraint703. Linear Methods for Regressionsvilweightpgg450.2lbph0.0Coefficients0.40.6lcavolgleason−0.2agelcp0.00.20.40.60.81.0Shrinkage Factor sFIGURE 3.10. Profiles of lasso coefficients,as the tuning parameter t is varied.PCoefficients are plotted versus s = t/ p1 |β̂j |. A vertical line is drawn at s = 0.36,the value chosen by cross-validation. Compare Figure 3.8 on page 65; the lassoprofiles hit zero, while those for ridge do not.

The profiles are piece-wise linear,and so are computed only at the points displayed; see Section 3.4.4 for details.3.4 Shrinkage Methods71TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.EstimatorFormulaBest subset (size M )β̂j · I(|β̂j | ≥ |β̂(M ) |)Ridgeβ̂j /(1 + λ)Lassosign(β̂j )(|β̂j | − λ)+Best SubsetRidgeLassoλ|β̂(M ) |(0,0)β2(0,0)^β.(0,0)β2β1^β.β1FIGURE 3.11.

Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1 | + |β2 | ≤ t and β12 + β22 ≤ t2 , respectively,while the red ellipses are the contours of the least squares error function.723.

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.