The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 77
Текст из файла (страница 77)
However, on finite samples squared-error loss places much moreemphasis on observations with large absolute residuals | yi − f (xi ) | duringthe fitting process. It is thus far less robust, and its performance severelydegrades for long-tailed error distributions and especially for grossly mismeasured y-values (“outliers”).
Other more robust criteria, such as absolute loss, perform much better in these situations. In the statistical robustness literature, a variety of regression loss criteria have been proposedthat provide strong resistance (if not absolute immunity) to gross outlierswhile being nearly as efficient as least squares for Gaussian errors. Theyare often better than either for error distributions with moderately heavytails. One such criterion is the Huber loss criterion used for M-regression(Huber, 1964)[y − f (x)]2for | y − f (x) | ≤ δ,L(y, f (x)) =(10.23)2δ| y − f (x) | − δ 2 otherwise.Figure 10.5 compares these three loss functions.These considerations suggest that when robustness is a concern, as isespecially the case in data mining applications (see Section 10.7), squarederror loss for regression and exponential loss for classification are not thebest criteria from a statistical perspective.
However, they both lead to theelegant modular boosting algorithms in the context of forward stagewiseadditive modeling. For squared-error loss one simply fits the base learnerto the residuals from the current model yi − fm−1 (xi ) at each step. For10. Boosting and Additive TreesSquared ErrorAbsolute ErrorHuber024Loss68350−3−2−10123y−fFIGURE 10.5. A comparison of three loss functions for regression, plotted as afunction of the margin y−f . The Huber loss function combines the good propertiesof squared-error loss near zero and absolute error loss when |y − f | is large.exponential loss one performs a weighted fit of the base learner to theoutput values yi , with weights wi = exp(−yi fm−1 (xi )).
Using other morerobust criteria directly in their place does not give rise to such simplefeasible boosting algorithms. However, in Section 10.10.2 we show how onecan derive simple elegant boosting algorithms based on any differentiableloss criterion, thereby producing highly robust boosting procedures for datamining.10.7 “Off-the-Shelf” Procedures for Data MiningPredictive learning is an important aspect of data mining.
As can be seenfrom this book, a wide variety of methods have been developed for predictive learning from data. For each particular method there are situationsfor which it is particularly well suited, and others where it performs badlycompared to the best that can be done with that data. We have attemptedto characterize appropriate situations in our discussions of each of the respective methods. However, it is seldom known in advance which procedurewill perform best or even well for any given problem.
Table 10.1 summarizessome of the characteristics of a number of learning methods.Industrial and commercial data mining applications tend to be especiallychallenging in terms of the requirements placed on learning procedures.Data sets are often very large in terms of number of observations andnumber of variables measured on each of them. Thus, computational con-10.7 “Off-the-Shelf” Procedures for Data Mining351TABLE 10.1.
Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.CharacteristicNeuralSVMTreesMARSNetsk-NN,KernelsNatural handling of dataof “mixed” type▼▼▲▲▼Handling of missing values▼▼▲▲▲Robustness to outliers ininput space▼▼▲▼▲Insensitive to monotonetransformations of inputs▼▼▲▼▼Computational scalability(large N )▼▼▲▲▼Ability to deal with irrelevant inputs▼▼▲▲▼Ability to extract linearcombinations of features▲▲▼▼◆Interpretability▼▼◆▲▼Predictive power▲▲▼◆▲siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical variables, the latter often with many levels.
There are generally many missingvalues, complete observations being rare. Distributions of numeric predictor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit.
In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pattern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclusion of which dramatically degrades the performance of many methods.In addition, data mining applications generally require interpretable models.
It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship35210. Boosting and Additive Treesbetween joint values of the input variables and the resulting predicted response value. Thus, black box methods such as neural networks, which canbe quite useful in purely predictive settings such as pattern recognition,are far less useful for data mining.These requirements of speed, interpretability and the messy nature ofthe data sharply limit the usefulness of most learning procedures as offthe-shelf methods for data mining.
An “off-the-shelf” method is one thatcan be directly applied to the data without requiring a great deal of timeconsuming data preprocessing or careful tuning of the learning procedure.Of all the well-known learning methods, decision trees come closest tomeeting the requirements for serving as an off-the-shelf procedure for datamining. They are relatively fast to construct and they produce interpretablemodels (if the trees are small). As discussed in Section 9.2, they naturallyincorporate mixtures of numeric and categorical predictor variables andmissing values.
They are invariant under (strictly monotone) transformations of the individual predictors. As a result, scaling and/or more generaltransformations are not an issue, and they are immune to the effects of predictor outliers. They perform internal feature selection as an integral partof the procedure.
They are thereby resistant, if not completely immune,to the inclusion of many irrelevant predictor variables. These properties ofdecision trees are largely the reason that they have emerged as the mostpopular learning method for data mining.Trees have one aspect that prevents them from being the ideal tool forpredictive learning, namely inaccuracy. They seldom provide predictive accuracy comparable to the best that can be achieved with the data at hand.As seen in Section 10.1, boosting decision trees improves their accuracy,often dramatically. At the same time it maintains most of their desirableproperties for data mining.
Some advantages of trees that are sacrificed byboosting are speed, interpretability, and, for AdaBoost, robustness againstoverlapping class distributions and especially mislabeling of the trainingdata. A gradient boosted model (GBM) is a generalization of tree boostingthat attempts to mitigate these problems, so as to produce an accurate andeffective off-the-shelf procedure for data mining.10.8 Example: Spam DataBefore we go into the details of gradient boosting, we demonstrate its abilities on a two-class classification problem.
The spam data are introduced inChapter 1, and used as an example for many of the procedures in Chapter 9(Sections 9.1.2, 9.2.5, 9.3.1 and 9.4.1).Applying gradient boosting to these data resulted in a test error rate of4.5%, using the same test set as was used in Section 9.1.2. By comparison,an additive logistic regression achieved 5.5%, a CART tree fully grown and10.9 Boosting Trees353pruned by cross-validation 8.7%, and MARS 5.5%.
The standard error ofthese estimates is around 0.6%, although gradient boosting is significantlybetter than all of them using the McNemar test (Exercise 10.6).In Section 10.13 below we develop a relative importance measure foreach predictor, as well as a partial dependence plot describing a predictor’scontribution to the fitted model. We now illustrate these for the spam data.Figure 10.6 displays the relative importance spectrum for all 57 predictorvariables.
Clearly some predictors are more important than others in separating spam from email. The frequencies of the character strings !, $, hp,and remove are estimated to be the four most relevant predictor variables.At the other end of the spectrum, the character strings 857, 415, table, and3d have virtually no relevance.The quantity being modeled here is the log-odds of spam versus emailf (x) = logPr(spam|x)Pr(email|x)(10.24)(see Section 10.13 below).
Figure 10.7 shows the partial dependence of thelog-odds on selected important predictors, two positively associated withspam (! and remove), and two negatively associated (edu and hp). Theseparticular dependencies are seen to be essentially monotonic. There is ageneral agreement with the corresponding functions found by the additivelogistic regression model; see Figure 9.1 on page 303.Running a gradient boosted model on these data with J = 2 terminalnode trees produces a purely additive (main effects) model for the logodds, with a corresponding error rate of 4.7%, as compared to 4.5% for thefull gradient boosted model (with J = 5 terminal-node trees).
Althoughnot significant, this slightly higher error rate suggests that there may beinteractions among some of the important predictor variables. This canbe diagnosed through two-variable partial dependence plots. Figure 10.8shows one of the several such plots displaying strong interaction effects.One sees that for very low frequencies of hp, the log-odds of spam aregreatly increased. For high frequencies of hp, the log-odds of spam tend tobe much lower and roughly constant as a function of !. As the frequencyof hp decreases, the functional relationship with ! strengthens.10.9 Boosting TreesRegression and classification trees are discussed in detail in Section 9.2.They partition the space of all joint predictor variable values into disjointregions Rj , j = 1, 2, .
. . , J, as represented by the terminal nodes of the tree.A constant γj is assigned to each such region and the predictive rule isx ∈ Rj ⇒ f (x) = γj .35410. Boosting and Additive Trees3daddresseslabstelnet857415directcstable85#partscredit[labconferencereportoriginaldataprojectfontmakeaddressorderallhpltechnologypeoplepmmailover650;meetingemail000internetreceive(rebusiness1999willmoneyouryoueduCAPTOTgeorgeCAPMAXyourCAPAVEfreeremovehp$!020406080100Relative ImportanceFIGURE 10.6. Predictor variable importance spectrum for the spam data.