The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 65

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 65 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 652020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 65)

Theproblem is that we have not put each of the models on the same footingby taking into account their complexity (the number of inputs m in thisexample).−i(x)Stacked generalization, or stacking, is a way of doing this. Let fˆmbe the prediction at x, using model m, applied to the dataset with theith training observation removed. The stacking estimate of the weights is−i(xi ), m =obtained from the least squares linear regression of yi on fˆm1, 2, . . .

, M . In detail the stacking weights are given by#2"MNXXst−iˆŵ = argminwm f (xi ) .yi −(8.59)wmi=1m=1Pst ˆThe final prediction ism ŵm fm (x). By using the cross-validated pre−i(x), stacking avoids giving unfairly high weight to models withdictions fˆmhigher complexity. Better results can be obtained by restricting the weightsto be nonnegative, and to sum to 1. This seems like a reasonable restrictionif we interpret the weights as posterior model probabilities as in equation(8.54), and it leads to a tractable quadratic programming problem.There is a close connection between stacking and model selection vialeave-one-out cross-validation (Section 7.10).

If we restrict the minimizationin (8.59) to weight vectors w that have one unit weight and the rest zero,this leads to a model choice m̂ with smallest leave-one-out cross-validationerror. Rather than choose a single model, stacking combines them withestimated optimal weights. This will often lead to better prediction, butless interpretability than the choice of only one of the M models.The stacking idea is actually more general than described above. Onecan use any learning method, not just linear regression, to combine themodels as in (8.59); the weights could also depend on the input locationx.

In this way, learning methods are “stacked” on top of one another, toimprove prediction performance.8.9 Stochastic Search: BumpingThe final method described in this chapter does not involve averaging orcombining models, but rather is a technique for finding a better singlemodel. Bumping uses bootstrap sampling to move randomly through modelspace. For problems where fitting method finds many local minima, bumping can help the method to avoid getting stuck in poor solutions.8.9 Stochastic Search: Bumping291Regular 4-Node TreeBumped 4-Node Tree•••• • ••• • •• •• ••••• • • •• •• •• •••••• • • •• •• • •• •••• •• • •••••••••• • ••• •••••••••• ••••••••• ••••• •••• • ••• • • • • • • ••• •••• ••••••• • ••• • •• • • •••• • • ••• •• •••• •• •••• ••••••••••••••••• •• ••• •• •• • •••• •••• • • ••• •• • ••• • • • •• • • • •• ••••• • • •• • • • • • • •• ••• •• •• •••••••• • • • •••••• • • • • • • • ••••• • • • •• •• •••• •• •• •• •• • ••• • • •• •••••• •••• • ••••• • • ••••• •• •••••••••••• •• •••••• •• •••••••• • • • •••• •• • • • •• • •• • • • • • •• •• • ••••• • • •• ••• • • •••• • •• •••••• • ••• • •• •• ••••• • • •• •• •• •••••• • • •• •• • •• •••• •• • •••••••••• • ••• •••••••••• ••••••••• ••••• •••• • ••• • • • • • • ••• •••• ••••••• • ••• • •• • • •••• • • ••• •• •••• •• •••• ••••••••••••••••• •• ••• •• •• • •••• •••• • • ••• •• • ••• • • • •• • • • •• ••••• • • •• • • • • • • •• ••• •• •• •••••••• • • • •••••• • • • • • • • ••••• • • • •• •• •••• •• •• •• •• • ••• • • •• •••••• •••• • ••••• • • ••••• •• •••••••••••• •• •••••• •• •••••••• • • • •••• •• • • • •• • •• • • • • • •• •• • ••••• • • •• ••• • • •••• • •• ••FIGURE 8.13.

Data with two features and two classes (blue and orange), displaying a pure interaction. The left panel shows the partition found by three splitsof a standard, greedy, tree-growing algorithm. The vertical grey line near the leftedge is the first split, and the broken lines are the two subsequent splits. The algorithm has no idea where to make a good initial split, and makes a poor choice.The right panel shows the near-optimal splits found by bumping the tree-growingalgorithm 20 times.As in bagging, we draw bootstrap samples and fit a model to each.

Butrather than average the predictions, we choose the model estimated from abootstrap sample that best fits the training data. In detail, we draw bootstrap samples Z∗1 , . . . , Z∗B and fit our model to each, giving predictionsfˆ∗b (x), b = 1, 2, . . . , B at input point x. We then choose the model thatproduces the smallest prediction error, averaged over the original trainingset. For squared error, for example, we choose the model obtained frombootstrap sample b̂, whereb̂ = arg minbNX[yi − fˆ∗b (xi )]2 .(8.60)i=1The corresponding model predictions are fˆ∗b̂ (x).

By convention we alsoinclude the original training sample in the set of bootstrap samples, so thatthe method is free to pick the original model if it has the lowest trainingerror.By perturbing the data, bumping tries to move the fitting procedurearound to good areas of model space. For example, if a few data points arecausing the procedure to find a poor solution, any bootstrap sample thatomits those data points should procedure a better solution.For another example, consider the classification data in Figure 8.13, thenotorious exclusive or (XOR) problem.

There are two classes (blue andorange) and two input features, with the features exhibiting a pure inter-2928. Model Inference and Averagingaction. By splitting the data at x1 = 0 and then splitting each resultingstrata at x2 = 0, (or vice versa) a tree-based classifier could achieve perfect discrimination. However, the greedy, short-sighted CART algorithm(Section 9.2) tries to find the best split on either feature, and then splitsthe resulting strata. Because of the balanced nature of the data, all initialsplits on x1 or x2 appear to be useless, and the procedure essentially generates a random split at the top level. The actual split found for these data isshown in the left panel of Figure 8.13.

By bootstrap sampling from the data,bumping breaks the balance in the classes, and with a reasonable numberof bootstrap samples (here 20), it will by chance produce at least one treewith initial split near either x1 = 0 or x2 = 0. Using just 20 bootstrapsamples, bumping found the near optimal splits shown in the right panelof Figure 8.13. This shortcoming of the greedy tree-growing algorithm isexacerbated if we add a number of noise features that are independent ofthe class label. Then the tree-growing algorithm cannot distinguish x1 orx2 from the others, and gets seriously lost.Since bumping compares different models on the training data, one mustensure that the models have roughly the same complexity. In the case oftrees, this would mean growing trees with the same number of terminalnodes on each bootstrap sample.

Bumping can also help in problems whereit is difficult to optimize the fitting criterion, perhaps because of a lack ofsmoothness. The trick is to optimize a different, more convenient criterionover the bootstrap samples, and then choose the model producing the bestresults for the desired criterion on the training sample.Bibliographic NotesThere are many books on classical statistical inference: Cox and Hinkley (1974) and Silvey (1975) give nontechnical accounts.

The bootstrapis due to Efron (1979) and is described more fully in Efron and Tibshirani (1993) and Hall (1992). A good modern book on Bayesian inferenceis Gelman et al. (1995). A lucid account of the application of Bayesianmethods to neural networks is given in Neal (1996).

The statistical application of Gibbs sampling is due to Geman and Geman (1984), and Gelfandand Smith (1990), with related work by Tanner and Wong (1987). Markovchain Monte Carlo methods, including Gibbs sampling and the Metropolis–Hastings algorithm, are discussed in Spiegelhalter et al. (1996). The EMalgorithm is due to Dempster et al. (1977); as the discussants in that paper make clear, there was much related, earlier work.

The view of EM asa joint maximization scheme for a penalized complete-data log-likelihoodwas elucidated by Neal and Hinton (1998); they credit Csiszar and Tusnády(1984) and Hathaway (1986) as having noticed this connection earlier. Bagging was proposed by Breiman (1996a). Stacking is due to Wolpert (1992);Exercises293Breiman (1996b) contains an accessible discussion for statisticians.

Leblancand Tibshirani (1996) describe variations on stacking based on the bootstrap. Model averaging in the Bayesian framework has been recently advocated by Madigan and Raftery (1994). Bumping was proposed by Tibshirani and Knight (1999).ExercisesEx. 8.1 Let r(y) and q(y) be probability density functions. Jensen’s inequality states that for a random variable X and a convex function φ(x),E[φ(X)] ≥ φ[E(X)]. Use Jensen’s inequality to show thatEq log[r(Y )/q(Y )](8.61)is maximized as a function of r(y) when r(y) = q(y). Hence show thatR(θ, θ) ≥ R(θ′ , θ) as stated below equation (8.46).Ex. 8.2 Consider the maximization of the log-likelihood(8.48), over disPtributions P̃ (Zm ) such that P̃ (Zm ) ≥ 0 and Zm P̃ (Zm ) = 1.

Use Lagrange multipliers to show that the solution is the conditional distributionP̃ (Zm ) = Pr(Zm |Z, θ′ ), as in (8.49).Ex. 8.3 Justify the estimate (8.50), using the relationshipZPr(A) = Pr(A|B)d(Pr(B)).Ex. 8.4 Consider the bagging method of Section 8.7. Let our estimate fˆ(x)be the B-spline smoother µ̂(x) of Section 8.2.1. Consider the parametricbootstrap of equation (8.6), applied to this estimator. Show that if we bagfˆ(x), using the parametric bootstrap to generate the bootstrap samples,the bagging estimate fˆbag (x) converges to the original estimate fˆ(x) asB → ∞.Ex.

8.5 Suggest generalizations of each of the loss functions in Figure 10.4to more than two classes, and design an appropriate plot to compare them.Ex. 8.6 Consider the bone mineral density data of Figure 5.6.(a) Fit a cubic smooth spline to the relative change in spinal BMD, as afunction of age. Use cross-validation to estimate the optimal amountof smoothing. Construct pointwise 90% confidence bands for the underlying function.(b) Compute the posterior mean and covariance for the true function via(8.28), and compare the posterior bands to those obtained in (a).2948. Model Inference and Averaging(c) Compute 100 bootstrap replicates of the fitted curves, as in the bottomleft panel of Figure 8.2.

Compare the results to those obtained in (a)and (b).Ex. 8.7 EM as a minorization algorithm(Hunter and Lange, 2004; Wu andLange, 2007). A function g(x, y) to said to minorize a function f (x) ifg(x, y) ≤ f (x), g(x, x) = f (x)(8.62)for all x, y in the domain. This is useful for maximizing f (x) since it is easyto show that f (x) is non-decreasing under the updatexs+1 = argmaxx g(x, xs )(8.63)There are analogous definitions for majorization, for minimizing a functionf (x). The resulting algorithms are known as MM algorithms, for “MinorizeMaximize” or “Majorize-Minimize.”Show that the EM algorithm (Section 8.5.2) is an example of an MM algorithm, using Q(θ′ , θ)+log Pr(Z|θ)−Q(θ, θ) to minorize the observed datalog-likelihood ℓ(θ′ ; Z).

(Note that only the first term involves the relevantparameter θ′ ).This is page 295Printer: Opaque this9Additive Models, Trees, and RelatedMethodsIn this chapter we begin our discussion of some specific methods for supervised learning. These techniques each assume a (different) structured formfor the unknown regression function, and by doing so they finesse the curseof dimensionality. Of course, they pay the possible price of misspecifyingthe model, and so in each case there is a tradeoff that has to be made. Theytake off where Chapters 3–6 left off.

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.