The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 68

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 68 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 682020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 68)

The full dataset sits at the top of the tree. Observationssatisfying the condition at each junction are assigned to the left branch,and the others to the right branch. The terminal nodes or leaves of thetree correspond to the regions R1 , R2 , . . . , R5 . The bottom right panel ofFigure 9.2 is a perspective plot of the regression surface from this model.For illustration, we chose the node means c1 = −5, c2 = −7, c3 = 0, c4 =2, c5 = 4 to make this plot.A key advantage of the recursive binary tree is its interpretability. Thefeature space partition is fully described by a single tree.

With more thantwo inputs, partitions like that in the top right panel of Figure 9.2 aredifficult to draw, but the binary tree representation works in the sameway. This representation is also popular among medical scientists, perhapsbecause it mimics the way that a doctor thinks. The tree stratifies the3069.

Additive Models, Trees, and Related MethodsR5t4X2X2R2R3R4t2R1t1X1 ≤ t1X2 ≤ t2t3X1X1|X1 ≤ t3X2 ≤ t4R1R2R3X2R4X1R5FIGURE 9.2. Partitions and CART. Top right panel shows a partition of atwo-dimensional feature space by recursive binary splitting, as used in CART,applied to some fake data. Top left panel shows a general partition that cannotbe obtained from recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right panel, and a perspective plot of theprediction surface appears in the bottom right panel.9.2 Tree-Based Methods307population into strata of high and low outcome, on the basis of patientcharacteristics.9.2.2 Regression TreesWe now turn to the question of how to grow a regression tree.

Our dataconsists of p inputs and a response, for each of N observations: that is,(xi , yi ) for i = 1, 2, . . . , N , with xi = (xi1 , xi2 , . . . , xip ). The algorithmneeds to automatically decide on the splitting variables and split points,and also what topology (shape) the tree should have. Suppose first that wehave a partition into M regions R1 , R2 , . . . , RM , and we model the responseas a constant cm in each region:f (x) =MXm=1cm I(x ∈ Rm ).(9.10)PIf we adopt as our criterion minimization of the sum of squares (yi −f (xi ))2 , it is easy to see that the best ĉm is just the average of yi in regionRm :ĉm = ave(yi |xi ∈ Rm ).(9.11)Now finding the best binary partition in terms of minimum sum of squaresis generally computationally infeasible.

Hence we proceed with a greedyalgorithm. Starting with all of the data, consider a splitting variable j andsplit point s, and define the pair of half-planesR1 (j, s) = {X|Xj ≤ s} and R2 (j, s) = {X|Xj > s}.(9.12)Then we seek the splitting variable j and split point s that solveihXX(9.13)(yi − c2 )2 .(yi − c1 )2 + minmin minj, sc1xi ∈R1 (j,s)c2xi ∈R2 (j,s)For any choice j and s, the inner minimization is solved byĉ1 = ave(yi |xi ∈ R1 (j, s)) and ĉ2 = ave(yi |xi ∈ R2 (j, s)).(9.14)For each splitting variable, the determination of the split point s canbe done very quickly and hence by scanning through all of the inputs,determination of the best pair (j, s) is feasible.Having found the best split, we partition the data into the two resultingregions and repeat the splitting process on each of the two regions.

Thenthis process is repeated on all of the resulting regions.How large should we grow the tree? Clearly a very large tree might overfitthe data, while a small tree might not capture the important structure.3089. Additive Models, Trees, and Related MethodsTree size is a tuning parameter governing the model’s complexity, and theoptimal tree size should be adaptively chosen from the data. One approachwould be to split tree nodes only if the decrease in sum-of-squares due to thesplit exceeds some threshold.

This strategy is too short-sighted, however,since a seemingly worthless split might lead to a very good split below it.The preferred strategy is to grow a large tree T0 , stopping the splittingprocess only when some minimum node size (say 5) is reached. Then thislarge tree is pruned using cost-complexity pruning, which we now describe.We define a subtree T ⊂ T0 to be any tree that can be obtained bypruning T0 , that is, collapsing any number of its internal (non-terminal)nodes.

We index terminal nodes by m, with node m representing regionRm . Let |T | denote the number of terminal nodes in T . LettingNm = #{xi ∈ Rm },1 Xyi ,ĉm =Nmxi ∈Rm1 XQm (T ) =(yi − ĉm )2 ,Nm(9.15)xi ∈Rmwe define the cost complexity criterionCα (T ) =|T |Xm=1Nm Qm (T ) + α|T |.(9.16)The idea is to find, for each α, the subtree Tα ⊆ T0 to minimize Cα (T ).The tuning parameter α ≥ 0 governs the tradeoff between tree size and itsgoodness of fit to the data. Large values of α result in smaller trees Tα , andconversely for smaller values of α. As the notation suggests, with α = 0 thesolution is the full tree T0 . We discuss how to adaptively choose α below.For each α one can show that there is a unique smallest subtree Tα thatminimizes Cα (T ).

To find Tα we use weakest link pruning: we successivelycollapsethe internal node that produces the smallest per-node increase inPNQm m m (T ), and continue until we produce the single-node (root) tree.This gives a (finite) sequence of subtrees, and one can show this sequencemust contain Tα . See Breiman et al. (1984) or Ripley (1996) for details.Estimation of α is achieved by five- or tenfold cross-validation: we choosethe value α̂ to minimize the cross-validated sum of squares. Our final treeis Tα̂ .9.2.3 Classification TreesIf the target is a classification outcome taking values 1, 2, .

. . , K, the onlychanges needed in the tree algorithm pertain to the criteria for splittingnodes and pruning the tree. For regression we used the squared-error node3090.59.2 Tree-Based Methodstro0.4Eniosias0.00.10.2MisclGiniificatnd0.3exnerrorpy0.00.20.40.60.81.0pFIGURE 9.3. Node impurity measures for two-class classification, as a functionof the proportion p in class 2. Cross-entropy has been scaled to pass through(0.5, 0.5).impurity measure Qm (T ) defined in (9.15), but this is not suitable forclassification. In a node m, representing a region Rm with Nm observations,let1 XI(yi = k),p̂mk =Nmxi ∈Rmthe proportion of class k observations in node m.

We classify the observations in node m to class k(m) = arg maxk p̂mk , the majority class innode m. Different measures Qm (T ) of node impurity include the following:1NmPI(yi 6= k(m)) = 1 − p̂mk(m) .PKGini index:k6=k′ p̂mk p̂mk′ =k=1 p̂mk (1 − p̂mk ).PKCross-entropy or deviance: − k=1 p̂mk log p̂mk .(9.17)For two classes, if p is the proportion in the second class, these three measures are 1 − max(p, 1 − p), 2p(1 − p) and −p log p − (1 − p) log (1 − p),respectively. They are shown in Figure 9.3.

All three are similar, but crossentropy and the Gini index are differentiable, and hence more amenable tonumerical optimization. Comparing (9.13) and (9.15), we see that we needto weight the node impurity measures by the number NmL and NmR ofobservations in the two child nodes created by splitting node m.In addition, cross-entropy and the Gini index are more sensitive to changesin the node probabilities than the misclassification rate. For example, ina two-class problem with 400 observations in each class (denote this by(400, 400)), suppose one split created nodes (300, 100) and (100, 300), whileMisclassification error:Pi∈Rm3109. Additive Models, Trees, and Related Methodsthe other created nodes (200, 400) and (200, 0). Both splits produce a misclassification rate of 0.25, but the second split produces a pure node and isprobably preferable.

Both the Gini index and cross-entropy are lower for thesecond split. For this reason, either the Gini index or cross-entropy shouldbe used when growing the tree. To guide cost-complexity pruning, any ofthe three measures can be used, but typically it is the misclassification rate.The Gini index can be interpreted in two interesting ways. Rather thanclassify observations to the majority class in the node, we could classifythem to class k withPprobability p̂mk . Then the training error rate of thisrule in the node is k6=k′ p̂mk p̂mk′ —the Gini index. Similarly, if we codeeach observation as 1 for class k and zero otherwise, the variance over thenode of this 0-1 response is p̂mk (1 − p̂mk ).

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.