The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 70
Текст из файла (страница 70)
Of the 13 distinct features chosen by the tree, 11 overlap withthe 16 significant features in the additive model (Table 9.2). The overallerror rate shown in Table 9.3 is about 50% higher than for the additivemodel in Table 9.1.Consider the rightmost branches of the tree. We branch to the rightwith a spam warning if more than 5.5% of the characters are the $ sign.3149. Additive Models, Trees, and Related Methodsα21753200.30.20.00.1Misclassification Rate0.4176010203040Tree SizeFIGURE 9.4. Results for spam example.
The blue curve is the 10-fold cross-validation estimate of misclassification rate as a function of tree size, with standarderror bars. The minimum occurs at a tree size with about 17 terminal nodes (usingthe “one-standard-error” rule). The orange curve is the test error, which tracksthe CV error quite closely. The cross-validation is indexed by values of α, shownabove. The tree sizes shown below refer to |Tα |, the size of the original tree indexedby α.However, if in addition the phrase hp occurs frequently, then this is likelyto be company business and we classify as email. All of the 22 cases inthe test set satisfying these criteria were correctly classified.
If the secondcondition is not met, and in addition the average length of repeated capitalletters CAPAVE is larger than 2.9, then we classify as spam. Of the 227 testcases, only seven were misclassified.In medical classification problems, the terms sensitivity and specificityare used to characterize a rule. They are defined as follows:Sensitivity: probability of predicting disease given true state is disease.Specificity: probability of predicting non-disease given true state is nondisease.9.2 Tree-Based Methods315email600/1536ch$<0.0555ch$>0.0555email280/1177spam48/359hp<0.405hp>0.405remove<0.06remove>0.06email180/1065spamch!>0.191email100/20480/861george<0.005george>0.005spamemail6/1090/3CAPAVE<2.7505CAPAVE>2.7505emailemailemailspam80/6520/20936/12316/81hp<0.03hp>0.030/22george<0.15CAPAVE<2.907george>0.15CAPAVE>2.907ch!<0.191emailemailspam26/3379/112spam19/110spam7/2271999<0.581999>0.58spam18/109email0/1free<0.065free>0.065emailemailemailspam77/4233/22916/949/29CAPMAX<10.5business<0.145CAPMAX>10.5business>0.145emailemailemailspam20/23857/18514/893/5receive<0.125edu<0.045receive>0.125edu>0.045emailspamemailemail19/2361/248/1139/72our<1.2our>1.2emailspam37/1011/12FIGURE 9.5.
The pruned tree for the spam example. The split variables areshown in blue on the branches, and the classification is shown in every node.Thenumbers under the terminal nodes indicate misclassification rates on the test data.9. Additive Models, Trees, and Related Methods1.0316••••• • • • •••••••••••••••••••••••• •••••• •• ••• •• ••0.6••••0.4Tree (0.95)GAM (0.98)Weighted Tree (0.90)•0.2Sensitivity0.8•0.0••0.00.20.40.60.81.0SpecificityFIGURE 9.6.
ROC curves for the classification rules fit to the spam data. Curvesthat are closer to the northeast corner represent better classifiers. In this case theGAM classifier dominates the trees. The weighted tree achieves better sensitivityfor higher specificity than the unweighted tree. The numbers in the legend represent the area under the curve.If we think of spam and email as the presence and absence of disease, respectively, then from Table 9.3 we haveSensitivity=Specificity=33.4= 86.3%,33.4 + 5.357.3100 ×= 93.4%.57.3 + 4.0100 ×In this analysis we have used equal losses.
As before let Lkk′ be theloss associated with predicting a class k object as class k ′ . By varying therelative sizes of the losses L01 and L10 , we increase the sensitivity anddecrease the specificity of the rule, or vice versa. In this example, we wantto avoid marking good email as spam, and thus we want the specificity tobe very high. We can achieve this by setting L01 > 1 say, with L10 = 1.The Bayes’ rule in each terminal node classifies to class 1 (spam) if theproportion of spam is ≥ L01 /(L10 + L01 ), and class zero otherwise.
The9.3 PRIM: Bump Hunting317receiver operating characteristic curve (ROC) is a commonly used summaryfor assessing the tradeoff between sensitivity and specificity. It is a plot ofthe sensitivity versus specificity as we vary the parameters of a classificationrule.
Varying the loss L01 between 0.1 and 10, and applying Bayes’ rule tothe 17-node tree selected in Figure 9.4, produced the ROC curve showninp Figure 9.6. The standard error of each curve near 0.9 is approximately0.9(1 − 0.9)/1536 = 0.008, and hence the standard error of the differenceis about 0.01. We see that in order to achieve a specificity of close to 100%,the sensitivity has to drop to about 50%. The area under the curve is acommonly used quantitative summary; extending the curve linearly in eachdirection so that it is defined over [0, 100], the area is approximately 0.95.For comparison, we have included the ROC curve for the GAM model fitto these data in Section 9.2; it gives a better classification rule for any loss,with an area of 0.98.Rather than just modifying the Bayes rule in the nodes, it is better totake full account of the unequal losses in growing the tree, as was donein Section 9.2.
With just two classes 0 and 1, losses may be incorporatedinto the tree-growing process by using weight Lk,1−k for an observation inclass k. Here we chose L01 = 5, L10 = 1 and fit the same size tree as before(|Tα | = 17). This tree has higher sensitivity at high values of the specificitythan the original tree, but does more poorly at the other extreme.
Its topfew splits are the same as the original tree, and then it departs from it.For this application the tree grown using L01 = 5 is clearly better than theoriginal tree.The area under the ROC curve, used above, is sometimes called the cstatistic. Interestingly, it can be shown that the area under the ROC curveis equivalent to the Mann-Whitney U statistic (or Wilcoxon rank-sum test),for the median difference between the prediction scores in the two groups(Hanley and McNeil, 1982). For evaluating the contribution of an additionalpredictor when added to a standard model, the c-statistic may not be aninformative measure. The new predictor can be very significant in termsof the change in model deviance, but show only a small increase in the cstatistic.
For example, removal of the highly significant term george fromthe model of Table 9.2 results in a decrease in the c-statistic of less than0.01. Instead, it is useful to examine how the additional predictor changesthe classification on an individual sample basis. A good discussion of thispoint appears in Cook (2007).9.3 PRIM: Bump HuntingTree-based methods (for regression) partition the feature space into boxshaped regions, to try to make the response averages in each box as differ-3189. Additive Models, Trees, and Related Methodsent as possible. The splitting rules defining the boxes are related to eachthrough a binary tree, facilitating their interpretation.The patient rule induction method (PRIM) also finds boxes in the featurespace, but seeks boxes in which the response average is high.
Hence it looksfor maxima in the target function, an exercise known as bump hunting. (Ifminima rather than maxima are desired, one simply works with the negativeresponse values.)PRIM also differs from tree-based partitioning methods in that the boxdefinitions are not described by a binary tree. This makes interpretation ofthe collection of rules more difficult; however, by removing the binary treeconstraint, the individual rules are often simpler.The main box construction method in PRIM works from the top down,starting with a box containing all of the data.
The box is compressed alongone face by a small amount, and the observations then falling outside thebox are peeled off. The face chosen for compression is the one resulting inthe largest box mean, after the compression is performed. Then the processis repeated, stopping when the current box contains some minimum numberof data points.This process is illustrated in Figure 9.7. There are 200 data points uniformly distributed over the unit square. The color-coded plot indicates theresponse Y taking the value 1 (red) when 0.5 < X1 < 0.8 and 0.4 < X2 <0.6. and zero (blue) otherwise. The panels shows the successive boxes foundby the top-down peeling procedure, peeling off a proportion α = 0.1 of theremaining data points at each stage.Figure 9.8 shows the mean of the response values in the box, as the boxis compressed.After the top-down sequence is computed, PRIM reverses the process,expanding along any edge, if such an expansion increases the box mean.This is called pasting.
Since the top-down procedure is greedy at each step,such an expansion is often possible.The result of these steps is a sequence of boxes, with different numbersof observation in each box. Cross-validation, combined with the judgmentof the data analyst, is used to choose the optimal box size.Denote by B1 the indices of the observations in the box found in step 1.The PRIM procedure then removes the observations in B1 from the trainingset, and the two-step process—top down peeling, followed by bottom-uppasting—is repeated on the remaining dataset.
This entire process is repeated several times, producing a sequence of boxes B1 , B2 , . . . , Bk . Eachbox is defined by a set of rules involving a subset of predictors like(a1 ≤ X1 ≤ b1 ) and (b1 ≤ X3 ≤ b2 ).A summary of the PRIM procedure is given Algorithm 9.3.PRIM can handle a categorical predictor by considering all partitions ofthe predictor, as in CART. Missing values are also handled in a mannersimilar to CART. PRIM is designed for regression (quantitative response9.3 PRIM: Bump Hunting3191234ooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o oooooooo o oo o oo ooo oooooo ooooo o ooo o oo o oo o o oooo o ooo oooo ooo o oooooooo o o oo o o ooo oo oo oo ooo oooo o o o oooo oo oooo oooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o oooooooo o oo o oo ooo oooooo ooooo o ooo o oo o oo o o oooo o ooo oooo ooo o oooooooo o o oo o o ooo oo oo oo ooo oooo o o o oooo oo oooo oooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o oooooooo o oo o oo ooo oooooo ooooo o ooo o oo o oo o o oooo o ooo oooo ooo o oooooooo o o oo o o ooo oo oo oo ooo oooo o o o oooo oo oooo oooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o oooooooo o oo o oo ooo oooooo ooooo o ooo o oo o oo o o oooo o ooo oooo ooo o oooooooo o o oo o o ooo oo oo oo ooo oooo o o o oooo oo oooo o5678ooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o ooooooo o oo oo oo ooo oooooo ooooo oo o o oooo o oooo o ooo o ooo o o oo ooooooooooo o o oo o o ooo oo oo oo o o oooo o oooo o oooo oo oo oooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o ooooooo o oo oo oo ooo oooooo ooooo oo o o oooo o oooo o ooo o ooo o o oo ooooooooooo o o oo o o ooo oo oo oo o o oooo o oooo o oooo oo oo oooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o ooooooo o oo oo oo ooo oooooo ooooo oo o o oooo o oooo o ooo o ooo o o oo ooooooooooo o o oo o o ooo oo oo oo o o oooo o oooo o oooo oo oo oooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o ooooooo o oo oo oo ooo oooooo ooooo oo o o oooo o oooo o ooo o ooo o o oo ooooooooooo o o oo o o ooo oo oo oo o o oooo o oooo o oooo oo oo o12172227ooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o oooo o ooo o oo oo oo ooo ooooo ooooo o ooo o oo o oo o o oooo o ooo oooo oo o oooooo o o o o o oooooooo oooo o ooo ooo o o o ooooo oo oooo oooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o oooo o ooo o oo oo oo ooo ooooo ooooo o ooo o oo o oo o o oooo o ooo oooo oo o oooooo o o o o o oooooooo oooo o ooo ooo o o o ooooo oo oooo oooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o oooo o ooo o oo oo oo ooo ooooo ooooo o ooo o oo o oo o o oooo o ooo oooo oo o oooooo o o o o o oooooooo oooo o ooo ooo o o o ooooo oo oooo oooooooo oooo o o o oo ooo o oooo oo oooo ooooo oo ooooooooo o ooo oo oooooooo ooooo oooo oo oo o o oooo o ooo o oo oo oo ooo ooooo ooooo o ooo o oo o oo o o oooo o ooo oooo oo o oooooo o o o o o oooooooo oooo o ooo ooo o o o ooooo oo oooo o0.60.40.2Box Mean0.81.0FIGURE 9.7.