The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 63

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 63 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 632020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 63)

We will see that Gibbs sampling,an MCMC procedure, is closely related to the EM algorithm: the main difference is that it samples from the conditional distributions rather thanmaximizing over them.Consider first the following abstract problem. We have random variablesU1 , U2 , . . . , UK and we wish to draw a sample from their joint distribution.Suppose this is difficult to do, but it is easy to simulate from the conditionaldistributions Pr(Uj |U1 , U2 , .

. . , Uj−1 , Uj+1 , . . . , UK ), j = 1, 2, . . . , K. TheGibbs sampling procedure alternatively simulates from each of these distributions and when the process stabilizes, provides a sample from the desiredjoint distribution. The procedure is defined in Algorithm 8.3.Under regularity conditions it can be shown that this procedure eventually stabilizes, and the resulting random variables are indeed a samplefrom the joint distribution of U1 , U2 , . . . , UK . This occurs despite the fact(t)(t)(t)that the samples (U1 , U2 , . . .

, UK ) are clearly not independent for different t. More formally, Gibbs sampling produces a Markov chain whosestationary distribution is the true joint distribution, and hence the term“Markov chain Monte Carlo.” It is not surprising that the true joint distribution is stationary under this process, as the successive steps leave themarginal distributions of the Uk ’s unchanged.2808. Model Inference and AveragingNote that we don’t need to know the explicit form of the conditionaldensities, but just need to be able to sample from them. After the procedurereaches stationarity, the marginal density of any subset of the variablescan be approximated by a density estimate applied to the sample values.However if the explicit form of the conditional density Pr(Uk , |Uℓ , ℓ 6= k)is available, a better estimate of say the marginal density of Uk can beobtained from (Exercise 8.3):c U (u) =PrkMX1(t)Pr(u|Uℓ , ℓ 6= k).(M − m + 1) t=m(8.50)Here we have averaged over the last M − m + 1 members of the sequence,to allow for an initial “burn-in” period before stationarity is reached.Now getting back to Bayesian inference, our goal is to draw a sample fromthe joint posterior of the parameters given the data Z.

Gibbs sampling willbe helpful if it is easy to sample from the conditional distribution of eachparameter given the other parameters and Z. An example—the Gaussianmixture problem—is detailed next.There is a close connection between Gibbs sampling from a posterior andthe EM algorithm in exponential family models. The key is to consider thelatent data Zm from the EM procedure to be another parameter for theGibbs sampler. To make this explicit for the Gaussian mixture problem,we take our parameters to be (θ, Zm ).

For simplicity we fix the variancesσ12 , σ22 and mixing proportion π at their maximum likelihood values so thatthe only unknown parameters in θ are the means µ1 and µ2 . The Gibbssampler for the mixture problem is given in Algorithm 8.4. We see thatsteps 2(a) and 2(b) are the same as the E and M steps of the EM procedure, except that we sample rather than maximize. In step 2(a), ratherthan compute the maximum likelihood responsibilities γi = E(∆i |θ, Z),the Gibbs sampling procedure simulates the latent data ∆i from the distributions Pr(∆i |θ, Z). In step 2(b), rather than compute the maximizers ofthe posterior Pr(µ1 , µ2 , ∆|Z) we simulate from the conditional distributionPr(µ1 , µ2 |∆, Z).Figure 8.8 shows 200 iterations of Gibbs sampling, with the mean parameters µ1 (lower) and µ2P(upper) shown in the left panel, and the proportionof class 2 observations i ∆i /N on the right.

Horizontal broken linesP havebeen drawn at the maximum likelihood estimate values µ̂1 , µ̂2 and i γ̂i /Nin each case. The values seem to stabilize quite quickly, and are distributedevenly around the maximum likelihood values.The above mixture model was simplified, in order to make the clearconnection between Gibbs sampling and the EM algorithm. More realistically, one would put a prior distribution on the variances σ12 , σ22 and mixingproportion π, and include separate Gibbs sampling steps in which we sample from their posterior distributions, conditional on the other parameters.One can also incorporate proper (informative) priors for the mean param-8.6 MCMC for Sampling from the Posterior281Algorithm 8.4 Gibbs sampling for mixtures.(0)(0)1.

Take some initial values θ(0) = (µ1 , µ2 ).2. Repeat for t = 1, 2, . . . , .(t)(a) For i = 1, 2, . . . , N generate ∆iγ̂i (θ(t) ), from equation (8.42).(t)∈ {0, 1} with Pr(∆i= 1) =(b) Setµ̂1=µ̂2=(t)PN(t)i=1 (1 − ∆i ) · yi,PN(t)i=1 (1 − ∆i )PN(t)i=1 ∆i · yi,PN(t)i=1 ∆i(t)and generate µ1 ∼ N (µ̂1 , σ̂12 ) and µ2 ∼ N (µ̂2 , σ̂22 ).(t)(t)0.60.50.30.4Mixing Proportion6420Mean Parameters0.783. Continue step 2 until the joint distribution of (∆(t) , µ1 , µ2 ) doesn’tchange050100150Gibbs Iteration200050100150200Gibbs IterationFIGURE 8.8.

Mixture example. (Left panel:) 200 values of the two mean parameters from Gibbs sampling; horizontal lines are drawn at the maximum likelihoodestimates µ̂1 , µ̂2 . (Right panel:) Proportion of values with ∆iP= 1, for each of the200 Gibbs sampling iterations; a horizontal line is drawn at i γ̂i /N .2828. Model Inference and Averagingeters. These priors must not be improper as this will lead to a degenerateposterior, with all the mixing weight on one component.Gibbs sampling is just one of a number of recently developed proceduresfor sampling from posterior distributions. It uses conditional sampling ofeach parameter given the rest, and is useful when the structure of the problem makes this sampling easy to carry out. Other methods do not requiresuch structure, for example the Metropolis–Hastings algorithm.

These andother computational Bayesian methods have been applied to sophisticatedlearning algorithms such as Gaussian process models and neural networks.Details may be found in the references given in the Bibliographic Notes atthe end of this chapter.8.7 BaggingEarlier we introduced the bootstrap as a way of assessing the accuracy of aparameter estimate or a prediction. Here we show how to use the bootstrapto improve the estimate or prediction itself.

In Section 8.4 we investigatedthe relationship between the bootstrap and Bayes approaches, and foundthat the bootstrap mean is approximately a posterior average. Baggingfurther exploits this connection.Consider first the regression problem. Suppose we fit a model to ourtraining data Z = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, obtaining the prediction fˆ(x) at input x. Bootstrap aggregation or bagging averages this prediction over a collection of bootstrap samples, thereby reducing its variance.For each bootstrap sample Z∗b , b = 1, 2, .

. . , B, we fit our model, givingprediction fˆ∗b (x). The bagging estimate is defined byB1 X ˆ∗bfˆbag (x) =f (x).B(8.51)b=1Denote by P̂ the empirical distribution putting equal probability 1/N oneach of the data points (xi , yi ). In fact the “true” bagging estimate is∗)} and eachdefined by EP̂ fˆ∗ (x), where Z∗ = {(x∗1 , y1∗ ), (x∗2 , y2∗ ), . . .

, (x∗N , yN∗ ∗(xi , yi ) ∼ P̂. Expression (8.51) is a Monte Carlo estimate of the truebagging estimate, approaching it as B → ∞.The bagged estimate (8.51) will differ from the original estimate fˆ(x)only when the latter is a nonlinear or adaptive function of the data. Forexample, to bag the B-spline smooth of Section 8.2.1, we average the curvesin the bottom left panel of Figure 8.2 at each value of x. The B-splinesmoother is linear in the data if we fix the inputs; hence if we sample usingthe parametric bootstrap in equation (8.6), then fˆbag (x) → fˆ(x) as B → ∞(Exercise 8.4). Hence bagging just reproduces the original smooth in the8.7 Bagging283top left panel of Figure 8.2.

The same is approximately true if we were tobag using the nonparametric bootstrap.A more interesting example is a regression tree, where fˆ(x) denotes thetree’s prediction at input vector x (regression trees are described in Chapter 9). Each bootstrap tree will typically involve different features than theoriginal, and might have a different number of terminal nodes. The baggedestimate is the average prediction at x from these B trees.Now suppose our tree produces a classifier Ĝ(x) for a K-class response.Here it is useful to consider an underlying indicator-vector function fˆ(x),with value a single one and K − 1 zeroes, such that Ĝ(x) = arg maxk fˆ(x).Then the bagged estimate fˆbag (x) (8.51) is a K-vector [p1 (x), p2 (x), .

. . ,pK (x)], with pk (x) equal to the proportion of trees predicting class k at x.The bagged classifier selects the class with the most “votes” from the Btrees, Ĝbag (x) = arg maxk fˆbag (x).Often we require the class-probability estimates at x, rather than theclassifications themselves. It is tempting to treat the voting proportionspk (x) as estimates of these probabilities. A simple two-class example showsthat they fail in this regard. Suppose the true probability of class 1 at x is0.75, and each of the bagged classifiers accurately predict a 1. Then p1 (x) =1, which is incorrect.

For many classifiers Ĝ(x), however, there is alreadyan underlying function fˆ(x) that estimates the class probabilities at x (fortrees, the class proportions in the terminal node). An alternative baggingstrategy is to average these instead, rather than the vote indicator vectors.Not only does this produce improved estimates of the class probabilities,but it also tends to produce bagged classifiers with lower variance, especiallyfor small B (see Figure 8.10 in the next example).8.7.1 Example: Trees with Simulated DataWe generated a sample of size N = 30, with two classes and p = 5 features,each having a standard Gaussian distribution with pairwise correlation0.95.

The response Y was generated according to Pr(Y = 1|x1 ≤ 0.5) = 0.2,Pr(Y = 1|x1 > 0.5) = 0.8. The Bayes error is 0.2. A test sample of size 2000was also generated from the same population. We fit classification trees tothe training sample and to each of 200 bootstrap samples (classificationtrees are described in Chapter 9). No pruning was used. Figure 8.9 showsthe original tree and eleven bootstrap trees. Notice how the trees are alldifferent, with different splitting features and cutpoints. The test error forthe original tree and the bagged tree is shown in Figure 8.10.

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.