The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 22

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 22 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 222020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 22)

The lasso path can have more than p steps, although the twoare often quite similar. Algorithm 3.2 with the lasso modification 3.2a isan efficient way of computing the solution to any lasso problem, especiallywhen p ≫ N . Osborne et al. (2000a) also discovered a piecewise-linear pathfor computing the lasso, which they called a homotopy algorithm.We now give a heuristic argument for why these procedures are so similar.Although the LAR algorithm is stated in terms of correlations, if the inputfeatures are standardized, it is equivalent and easier to work with innerproducts.

Suppose A is the active set of variables at some stage in thealgorithm, tied in their absolute inner-product with the current residualsy − Xβ. We can express this asxTj (y − Xβ) = γ · sj , ∀j ∈ A(3.56)where sj ∈ {−1, 1} indicates the sign of the inner-product, and γ is thecommon value. Also |xTk (y − Xβ)| ≤ γ ∀k 6∈ A.

Now consider the lassocriterion (3.52), which we write in vector formR(β) = 12 ||y − Xβ||22 + λ||β||1 .(3.57)Let B be the active set of variables in the solution for a given value of λ.For these variables R(β) is differentiable, and the stationarity conditionsgivexTj (y − Xβ) = λ · sign(βj ), ∀j ∈ B(3.58)Comparing (3.58) with (3.56), we see that they are identical only if thesign of βj matches the sign of the inner product. That is why the LAR3.4 Shrinkage Methods77algorithm and lasso start to differ when an active coefficient passes throughzero; condition (3.58) is violated for that variable, and it is kicked out of theactive set B.

Exercise 3.23 shows that these equations imply a piecewiselinear coefficient profile as λ decreases. The stationarity conditions for thenon-active variables require that|xTk (y − Xβ)| ≤ λ, ∀k 6∈ B,(3.59)which again agrees with the LAR algorithm.Figure 3.16 compares LAR and lasso to forward stepwise and stagewiseregression.

The setup is the same as in Figure 3.6 on page 59, except hereN = 100 here rather than 300, so the problem is more difficult. We seethat the more aggressive forward stepwise starts to overfit quite early (wellbefore the 10 true variables can enter the model), and ultimately performsworse than the slower forward stagewise regression. The behavior of LARand lasso is similar to that of forward stagewise regression.

Incrementalforward stagewise is similar to LAR and lasso, and is described in Section 3.8.1.Degrees-of-Freedom Formula for LAR and LassoSuppose that we fit a linear model via the least angle regression procedure,stopping at some number of steps k < p, or equivalently using a lasso boundt that produces a constrained version of the full least squares fit. How manyparameters, or “degrees of freedom” have we used?Consider first a linear regression using a subset of k features.

If this subsetis prespecified in advance without reference to the training data, then thedegrees of freedom used in the fitted model is defined to be k. Indeed, inclassical statistics, the number of linearly independent parameters is whatis meant by “degrees of freedom.” Alternatively, suppose that we carry outa best subset selection to determine the “optimal” set of k predictors.

Thenthe resulting model has k parameters, but in some sense we have used upmore than k degrees of freedom.We need a more general definition for the effective degrees of freedom ofan adaptively fitted model. We define the degrees of freedom of the fittedvector ŷ = (ŷ1 , ŷ2 , . . . , ŷN ) asdf(ŷ) =N1 XCov(ŷi , yi ).σ 2 i=1(3.60)Here Cov(ŷi , yi ) refers to the sampling covariance between the predictedvalue ŷi and its corresponding outcome value yi . This makes intuitive sense:the harder that we fit to the data, the larger this covariance and hencedf(ŷ). Expression (3.60) is a useful notion of degrees of freedom, one thatcan be applied to any model prediction ŷ.

This includes models that are3. Linear Methods for RegressionForward StepwiseLARLassoForward StagewiseIncremental Forward Stagewise0.600.55E||β̂(k) − β||20.65780.00.20.40.60.81.0Fraction of L1 arc-lengthFIGURE 3.16. Comparison of LAR and lasso with forward stepwise, forwardstagewise (FS) and incremental forward stagewise (FS0 ) regression. The setupis the same as in Figure 3.6, except N = 100 here rather than 300. Here theslower FS regression ultimately outperforms forward stepwise.

LAR and lassoshow similar behavior to FS and FS0 . Since the procedures take different numbersof steps (across simulation replicates and methods), we plot the MSE as a functionof the fraction of total L1 arc-length toward the least-squares fit.adaptively fitted to the training data. This definition is motivated anddiscussed further in Sections 7.4–7.6.Now for a linear regression with k fixed predictors, it is easy to showthat df(ŷ) = k. Likewise for ridge regression, this definition leads to theclosed-form expression (3.50) on page 68: df(ŷ) = tr(Sλ ). In both thesecases, (3.60) is simple to evaluate because the fit ŷ = Hλ y is linear in y.If we think about definition (3.60) in the context of a best subset selectionof size k, it seems clear that df(ŷ) will be larger than k, and this can beverified by estimating Cov(ŷi , yi )/σ 2 directly by simulation. However thereis no closed form method for estimating df(ŷ) for best subset selection.For LAR and lasso, something magical happens.

These techniques areadaptive in a smoother way than best subset selection, and hence estimationof degrees of freedom is more tractable. Specifically it can be shown thatafter the kth step of the LAR procedure, the effective degrees of freedom ofthe fit vector is exactly k. Now for the lasso, the (modified) LAR procedure3.5 Methods Using Derived Input Directions79often takes more than p steps, since predictors can drop out.

Hence thedefinition is a little different; for the lasso, at any stage df(ŷ) approximatelyequals the number of predictors in the model. While this approximationworks reasonably well anywhere in the lasso path, for each k it works bestat the last model in the sequence that contains k predictors. A detailedstudy of the degrees of freedom for the lasso may be found in Zou et al.(2007).3.5 Methods Using Derived Input DirectionsIn many situations we have a large number of inputs, often very correlated.The methods in this section produce a small number of linear combinationsZm , m = 1, . .

. , M of the original inputs Xj , and the Zm are then used inplace of the Xj as inputs in the regression. The methods differ in how thelinear combinations are constructed.3.5.1 Principal Components RegressionIn this approach the linear combinations Zm used are the principal components as defined in Section 3.4.1 above.Principal component regression forms the derived input columns zm =Xvm , and then regresses y on z1 , z2 , . . .

, zM for some M ≤ p. Since the zmare orthogonal, this regression is just a sum of univariate regressions:pcrŷ(M) = ȳ1 +MXθ̂m zm ,(3.61)m=1where θ̂m = hzm , yi/hzm , zm i. Since the zm are each linear combinationsof the original xj , we can express the solution (3.61) in terms of coefficientsof the xj (Exercise 3.13):β̂pcr(M ) =MXθ̂m vm .(3.62)m=1As with ridge regression, principal components depend on the scaling ofthe inputs, so typically we first standardize them. Note that if M = p, wewould just get back the usual least squares estimates, since the columns ofZ = UD span the column space of X.

For M < p we get a reduced regression. We see that principal components regression is very similar to ridgeregression: both operate via the principal components of the input matrix. Ridge regression shrinks the coefficients of the principal components(Figure 3.17), shrinking more depending on the size of the correspondingeigenvalue; principal components regression discards the p − M smallesteigenvalue components. Figure 3.17 illustrates this.3.

Linear Methods for Regression0.8•••••0.4••••••0.6•••0.2ridgepcr0.0Shrinkage Factor1.080•246••8IndexFIGURE 3.17. Ridge regression shrinks the regression coefficients of the principal components, using shrinkage factors d2j /(d2j + λ) as in (3.47). Principalcomponent regression truncates them. Shown are the shrinkage and truncationpatterns corresponding to Figure 3.7, as a function of the principal componentindex.In Figure 3.7 we see that cross-validation suggests seven terms; the resulting model has the lowest test error in Table 3.3.3.5.2 Partial Least SquaresThis technique also constructs a set of linear combinations of the inputsfor regression, but unlike principal components regression it uses y (in addition to X) for this construction.

Like principal component regression,partial least squares (PLS) is not scale invariant, so we assume that eachxj is standardized to have mean 0 and variance 1. PLS begins by computingPϕ̂1j = hxj , yi for each j. From this we construct the derived inputz1 =j ϕ̂1j xj , which is the first partial least squares direction. Hencein the construction of each zm , the inputs are weighted by the strengthof their univariate effect on y3 . The outcome y is regressed on z1 givingcoefficient θ̂1 , and then we orthogonalize x1 , . . . , xp with respect to z1 . Wecontinue this process, until M ≤ p directions have been obtained.

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.