The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 56

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 56 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 562020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 56)

In general, a linear indicator function in p dimensions has VCdimension p + 1, which is also the number of free parameters. On the otherhand, it can be shown that the family sin(αx) has infinite VC dimension,as Figure 7.5 suggests. By appropriate choice of α, any set of points can beshattered by this class (Exercise 7.8).So far we have discussed the VC dimension only of indicator functions,but this can be extended to real-valued functions. The VC dimension of aclass of real-valued functions {g(x, α)} is defined to be the VC dimensionof the indicator class {I(g(x, α) − β > 0)}, where β takes values over therange of g.One can use the VC dimension in constructing an estimate of (extrasample) prediction error; different types of results are available.

Using theconcept of VC dimension, one can prove results about the optimism of thetraining error when using a class of functions. An example of such a result isthe following. If we fit N training points using a class of functions {f (x, α)}having VC dimension h, then with probability at least 1 − η over training7.9 Vapnik–Chervonenkis Dimensionsets:ErrT≤ErrT≤239r4 · err ǫ1+ 1+(binary classification)err +2ǫerr√(regression)(7.46)(1 − c ǫ)+h[log (a2 N/h) + 1] − log (η/4),where ǫ = a1Nand 0 < a1 ≤ 4, 0 < a2 ≤ 2These bounds hold simultaneously for all members f (x, α), and are takenfrom Cherkassky and Mulier (2007, pages 116–118).

They recommend thevalue c = 1. For regression they suggest a1 = a2 = 1, and for classificationthey make no recommendation, with a1 = 4 and a2 = 2 correspondingto worst-case scenarios. They also give an alternative practical bound forregression!−1rlog N(7.47)ErrT ≤ err 1 − ρ − ρ log ρ +2N+hwith ρ = N, which is free of tuning constants. The bounds suggest that theoptimism increases with h and decreases with N in qualitative agreementwith the AIC correction d/N given in (7.24).

However, the results in (7.46)are stronger: rather than giving the expected optimism for each fixed function f (x, α), they give probabilistic upper bounds for all functions f (x, α),and hence allow for searching over the class.Vapnik’s structural risk minimization (SRM) approach fits a nested sequence of models of increasing VC dimensions h1 < h2 < · · · , and thenchooses the model with the smallest value of the upper bound.We note that upper bounds like the ones in (7.46) are often very loose,but that doesn’t rule them out as good criteria for model selection, wherethe relative (not absolute) size of the test error is important.

The maindrawback of this approach is the difficulty in calculating the VC dimensionof a class of functions. Often only a crude upper bound for VC dimensionis obtainable, and this may not be adequate. An example in which thestructural risk minimization program can be successfully carried out is thesupport vector classifier, discussed in Section 12.2.7.9.1 Example (Continued)Figure 7.7 shows the results when AIC, BIC and SRM are used to selectthe model size for the examples of Figure 7.3.

For the examples labeled KNN,the model index α refers to neighborhood size, while for those labeled REG αrefers to subset size. Using each selection method (e.g., AIC) we estimatedthe best model α̂ and found its true prediction error ErrT (α̂) on a testset. For the same training set we computed the prediction error of the best2407. Model Assessment and Selection806040200% Increase Over Best100AICreg/KNNreg/linearclass/KNNclass/linearclass/KNNclass/linearclass/KNNclass/linear806040200% Increase Over Best100BICreg/KNNreg/linear806040200% Increase Over Best100SRMreg/KNNreg/linearFIGURE 7.7. Boxplots show the distribution of the relative error100 × [ErrT (α̂) − minα ErrT (α)]/[maxα ErrT (α) − minα ErrT (α)] over the fourscenarios of Figure 7.3. This is the error in using the chosen model relative tothe best model.

There are 100 training sets each of size 80 represented in eachboxplot, with the errors computed on test sets of size 10, 000.7.10 Cross-Validation241and worst possible model choices: minα ErrT (α) and maxα ErrT (α). Theboxplots show the distribution of the quantity100 ×ErrT (α̂) − minα ErrT (α),maxα ErrT (α) − minα ErrT (α)which represents the error in using the chosen model relative to the bestmodel. For linear regression the model complexity was measured by thenumber of features; as mentioned in Section 7.5, this underestimates thedf, since it does not charge for the search for the best model of that size.This was also used for the VC dimension of the linear classifier.

For knearest neighbors, we used the quantity N/k. Under an additive-error regression model, this can be justified as the exact effective degrees of freedom (Exercise 7.6); we do not know if it corresponds to the VC dimension. We used a1 = a2 = 1 for the constants in (7.46); the results for SRMchanged with different constants, and this choice gave the most favorable results. We repeated the SRM selection using the alternative practical bound(7.47), and got almost identical results. For misclassification error we usedσˆε 2 = [N/(N − d)] · err(α) for the least restrictive model (k = 5 for KNN,since k = 1 results in zero training error).

The AIC criterion seems to workwell in all four scenarios, despite the lack of theoretical support with 0–1loss. BIC does nearly as well, while the performance of SRM is mixed.7.10 Cross-ValidationProbably the simplest and most widely used method for estimating prediction error is cross-validation. This method directly estimates the expectedextra-sample error Err = E[L(Y, fˆ(X))], the average generalization errorwhen the method fˆ(X) is applied to an independent test sample from thejoint distribution of X and Y .

As mentioned earlier, we might hope thatcross-validation estimates the conditional error, with the training set Theld fixed. But as we will see in Section 7.12, cross-validation typicallyestimates well only the expected prediction error.7.10.1 K-Fold Cross-ValidationIdeally, if we had enough data, we would set aside a validation set and useit to assess the performance of our prediction model. Since data are oftenscarce, this is usually not possible.

To finesse the problem, K-fold crossvalidation uses part of the available data to fit the model, and a differentpart to test it. We split the data into K roughly equal-sized parts; forexample, when K = 5, the scenario looks like this:2427. Model Assessment and Selection12345TrainTrainValidationTrainTrainFor the kth part (third above), we fit the model to the other K − 1 partsof the data, and calculate the prediction error of the fitted model whenpredicting the kth part of the data.

We do this for k = 1, 2, . . . , K andcombine the K estimates of prediction error.Here are more details. Let κ : {1, . . . , N } 7→ {1, . . . , K} be an indexingfunction that indicates the partition to which observation i is allocated bythe randomization. Denote by fˆ−k (x) the fitted function, computed withthe kth part of the data removed. Then the cross-validation estimate ofprediction error isCV(fˆ) =N1 XL(yi , fˆ−κ(i) (xi )).N i=1(7.48)Typical choices of K are 5 or 10 (see below). The case K = N is knownas leave-one-out cross-validation. In this case κ(i) = i, and for the ithobservation the fit is computed using all the data except the ith.Given a set of models f (x, α) indexed by a tuning parameter α, denoteby fˆ−k (x, α) the αth model fit with the kth part of the data removed.

Thenfor this set of models we defineCV(fˆ, α) =N1 XL(yi , fˆ−κ(i) (xi , α)).N i=1(7.49)The function CV(fˆ, α) provides an estimate of the test error curve, and wefind the tuning parameter α̂ that minimizes it. Our final chosen model isf (x, α̂), which we then fit to all the data.It is interesting to wonder about what quantity K-fold cross-validationestimates. With K = 5 or 10, we might guess that it estimates the expected error Err, since the training sets in each fold are quite differentfrom the original training set. On the other hand, if K = N we mightguess that cross-validation estimates the conditional error ErrT .

It turnsout that cross-validation only estimates effectively the average error Err,as discussed in Section 7.12.What value should we choose for K? With K = N , the cross-validationestimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the N “training sets” are so similarto one another. The computational burden is also considerable, requiringN applications of the learning method.

In certain special problems, thiscomputation can be done quickly—see Exercises 7.3 and 5.13.2430.40.00.21-Err0.60.87.10 Cross-Validation050100150Size of Training Set200FIGURE 7.8. Hypothetical learning curve for a classifier on a given task: aplot of 1 − Err versus the size of the training set N . With a dataset of 200observations, 5-fold cross-validation would use training sets of size 160, whichwould behave much like the full set. However, with a dataset of 50 observationsfivefold cross-validation would use training sets of size 40, and this would resultin a considerable overestimate of prediction error.On the other hand, with K = 5 say, cross-validation has lower variance.But bias could be a problem, depending on how the performance of thelearning method varies with the size of the training set.

Figure 7.8 showsa hypothetical “learning curve” for a classifier on a given task, a plot of1 − Err versus the size of the training set N . The performance of theclassifier improves as the training set size increases to 100 observations;increasing the number further to 200 brings only a small benefit. If ourtraining set had 200 observations, fivefold cross-validation would estimatethe performance of our classifier over training sets of size 160, which fromFigure 7.8 is virtually the same as the performance for training set size200. Thus cross-validation would not suffer from much bias. However if thetraining set had 50 observations, fivefold cross-validation would estimatethe performance of our classifier over training sets of size 40, and from thefigure that would be an underestimate of 1 − Err.

Hence as an estimate ofErr, cross-validation would be biased upward.To summarize, if the learning curve has a considerable slope at the giventraining set size, five- or tenfold cross-validation will overestimate the trueprediction error. Whether this bias is a drawback in practice depends onthe objective. On the other hand, leave-one-out cross-validation has lowbias but can have high variance. Overall, five- or tenfold cross-validationare recommended as a good compromise: see Breiman and Spector (1992)and Kohavi (1995).Figure 7.9 shows the prediction error and tenfold cross-validation curveestimated from a single training set, from the scenario in the bottom rightpanel of Figure 7.3.

This is a two-class classification problem, using a lin-7. Model Assessment and Selection•0.20.30.4•••• ••••• • ••••••• • • •• • • • • •• •• •• ••• •0.00.1Misclassification Error0.50.62445101520Subset Size pFIGURE 7.9. Prediction error (orange) and tenfold cross-validation curve(blue) estimated from a single training set, from the scenario in the bottom rightpanel of Figure 7.3.ear model with best subsets regression of subset size p. Standard error barsare shown, which are the standard errors of the individual misclassificationerror rates for each of the ten parts. Both curves have minima at p = 10,although the CV curve is rather flat beyond 10.

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.