The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 67

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 67 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 672020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 67)

Examples include business, address, internet,9.1 Generalized Additive Models301TABLE 9.1. Test data confusion matrix for the additive logistic regression modelfit to the spam training data. The overall test error rate is 5.5%.True Classemail (0)spam (1)Predicted Classemail (0) spam (1)58.3%2.5%3.0%36.3%free, and george. The idea was that these could be customized forindividual users.• 6 quantitative predictors—the percentage of characters in the emailthat match a given character.

The characters are ch;, ch(, ch[, ch!,ch$, and ch#.• The average length of uninterrupted sequences of capital letters:CAPAVE.• The length of the longest uninterrupted sequence of capital letters:CAPMAX.• The sum of the length of uninterrupted sequences of capital letters:CAPTOT.We coded spam as 1 and email as zero. A test set of size 1536 was randomlychosen, leaving 3065 observations in the training set. A generalized additivemodel was fit, using a cubic smoothing spline with a nominal four degrees offreedom for each predictor.

What this means is that for each predictor Xj ,the smoothing-spline parameter λj was chosen so that trace[Sj (λj )]−1 = 4,where Sj (λ) is the smoothing spline operator matrix constructed using theobserved values xij , i = 1, . . . , N . This is a convenient way of specifyingthe amount of smoothing in such a complex model.Most of the spam predictors have a very long-tailed distribution. Beforefitting the GAM model, we log-transformed each variable (actually log(x +0.1)), but the plots in Figure 9.1 are shown as a function of the originalvariables.The test error rates are shown in Table 9.1; the overall error rate is 5.3%.By comparison, a linear logistic regression has a test error rate of 7.6%.Table 9.2 shows the predictors that are highly significant in the additivemodel.For ease of interpretation, in Table 9.2 the contribution for each variableis decomposed into a linear component and the remaining nonlinear component.

The top block of predictors are positively correlated with spam,while the bottom block is negatively correlated. The linear component is aweighted least squares linear fit of the fitted curve on the predictor, whilethe nonlinear part is the residual. The linear component of an estimated3029.

Additive Models, Trees, and Related MethodsTABLE 9.2. Significant predictors from the additive model fit to the spam training data. The coefficients represent the linear part of fˆj , along with their standarderrors and Z-score. The nonlinear P-value is for a test of nonlinearity of fˆj .NameNum.dfouroverremoveinternetfreebusinesshplch!ch$CAPMAXCAPTOT5678161726525356573.93.94.04.03.93.83.84.03.93.84.0hpgeorge1999reedu25273745463.93.73.83.94.0CoefficientStd.

ErrorPositive effects0.5660.1140.2440.1950.9490.1830.5240.1760.5070.1270.7790.1860.0450.2500.6740.1281.4190.2800.2470.2280.7550.165Negative effects−1.4040.224−5.0030.744−0.6720.191−0.6200.133−1.1830.209Z ScoreNonlinearP -value4.9701.2495.2012.9744.0104.1790.1815.2835.0621.0804.5660.0520.0040.0930.0280.0650.1940.0020.1640.3540.0000.063−6.262−6.722−3.512−4.649−5.6470.1400.0450.0110.5970.000function is summarized by the coefficient, standard error and Z-score; thelatter is the coefficient divided by its standard error, and is consideredsignificant if it exceeds the appropriate quantile of a standard normal distribution. The column labeled nonlinear P -value is a test of nonlinearityof the estimated function.

Note, however, that the effect of each predictoris fully adjusted for the entire effects of the other predictors, not just fortheir linear parts. The predictors shown in the table were judged significant by at least one of the tests (linear or nonlinear) at the p = 0.01 level(two-sided).Figure 9.1 shows the estimated functions for the significant predictorsappearing in Table 9.2. Many of the nonlinear effects appear to account fora strong discontinuity at zero. For example, the probability of spam dropssignificantly as the frequency of george increases from zero, but then doesnot change much after that.

This suggests that one might replace each ofthe frequency predictors by an indicator variable for a zero count, and resortto a linear logistic model. This gave a test error rate of 7.4%; including thelinear effects of the frequencies as well dropped the test error to 6.6%. Itappears that the nonlinearities in the additive model have an additionalpredictive power.30350fˆ(internet)50fˆ(remove)5-5-5-50fˆ(over)0-5fˆ(our)510109.1 Generalized Additive Models02468012302over46046810internet00-5fˆ(hpl)-10-5-10-5fˆ(hp)50fˆ(business)50-5fˆ(free)2remove1010our0246810024605business1015200510hphpl-5fˆ(edu)-100fˆ(re)-10-50fˆ(1999)-50-5-10fˆ(george)5055free010203002460519991015200re51015edu0fˆ(CAPTOT)0-5-50fˆ(ch$)0-5-5fˆ(ch!)5fˆ(CAPMAX)5551010george01020ch!300123ch$456020006000CAPMAX10000050001000015000CAPTOTFIGURE 9.1. Spam analysis: estimated functions for significant predictors.

Therug plot along the bottom of each frame indicates the observed values of the corresponding predictor. For many of the predictors the nonlinearity picks up thediscontinuity at zero.3049. Additive Models, Trees, and Related MethodsIt is more serious to classify a genuine email message as spam, since thena good email would be filtered out and would not reach the user. We canalter the balance between the class error rates by changing the losses (seeSection 2.4). If we assign a loss L01 for predicting a true class 0 as class 1,and L10 for predicting a true class 1 as class 0, then the estimated Bayesrule predicts class 1 if its probability is greater than L01 /(L01 + L10 ).

Forexample, if we take L01 = 10, L10 = 1 then the (true) class 0 and class 1error rates change to 0.8% and 8.7%.More ambitiously, we can encourage the model to fit better data in theclass 0 by using weights L01 for the class 0 observations and L10 for theclass 1 observations. As above, we then use the estimated Bayes rule topredict. This gave error rates of 1.2% and 8.0% in (true) class 0 and class 1,respectively.

We discuss below the issue of unequal losses further, in thecontext of tree-based models.After fitting an additive model, one should check whether the inclusionof some interactions can significantly improve the fit. This can be done“manually,” by inserting products of some or all of the significant inputs,or automatically via the MARS procedure (Section 9.4).This example uses the additive model in an automatic fashion.

As a dataanalysis tool, additive models are often used in a more interactive fashion,adding and dropping terms to determine their effect. By calibrating theamount of smoothing in terms of df j , one can move seamlessly betweenlinear models (df j = 1) and partially linear models, where some terms aremodeled more flexibly. See Hastie and Tibshirani (1990) for more details.9.1.3 SummaryAdditive models provide a useful extension of linear models, making themmore flexible while still retaining much of their interpretability.

The familiartools for modeling and inference in linear models are also available foradditive models, seen for example in Table 9.2. The backfitting procedurefor fitting these models is simple and modular, allowing one to choose afitting method appropriate for each input variable. As a result they havebecome widely used in the statistical community.However additive models can have limitations for large data-mining applications. The backfitting algorithm fits all predictors, which is not feasible or desirable when a large number are available. The BRUTO procedure(Hastie and Tibshirani, 1990, Chapter 9) combines backfitting with selection of inputs, but is not designed for large data-mining problems.

Therehas also been recent work using lasso-type penalties to estimate sparse additive models, for example the COSSO procedure of Lin and Zhang (2006)and the SpAM proposal of Ravikumar et al. (2008). For large problems aforward stagewise approach such as boosting (Chapter 10) is more effective,and also allows for interactions to be included in the model.9.2 Tree-Based Methods3059.2 Tree-Based Methods9.2.1 BackgroundTree-based methods partition the feature space into a set of rectangles, andthen fit a simple model (like a constant) in each one. They are conceptuallysimple yet powerful.

We first describe a popular method for tree-basedregression and classification called CART, and later contrast it with C4.5,a major competitor.Let’s consider a regression problem with continuous response Y and inputs X1 and X2 , each taking values in the unit interval. The top left panelof Figure 9.2 shows a partition of the feature space by lines that are parallelto the coordinate axes. In each partition element we can model Y with adifferent constant. However, there is a problem: although each partitioningline has a simple description like X1 = c, some of the resulting regions arecomplicated to describe.To simplify matters, we restrict attention to recursive binary partitionslike that in the top right panel of Figure 9.2. We first split the space intotwo regions, and model the response by the mean of Y in each region.We choose the variable and split-point to achieve the best fit.

Then oneor both of these regions are split into two more regions, and this processis continued, until some stopping rule is applied. For example, in the topright panel of Figure 9.2, we first split at X1 = t1 . Then the region X1 ≤ t1is split at X2 = t2 and the region X1 > t1 is split at X1 = t3 . Finally, theregion X1 > t3 is split at X2 = t4 . The result of this process is a partitioninto the five regions R1 , R2 , . . . , R5 shown in the figure. The correspondingregression model predicts Y with a constant cm in region Rm , that is,fˆ(X) =5Xm=1cm I{(X1 , X2 ) ∈ Rm }.(9.9)This same model can be represented by the binary tree in the bottom leftpanel of Figure 9.2.

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.