The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 69
Текст из файла (страница 69)
Summing over classes k againgives the Gini index.9.2.4 Other IssuesCategorical PredictorsWhen splitting a predictor having q possible unordered values, there are2q−1 − 1 possible partitions of the q values into two groups, and the computations become prohibitive for large q. However, with a 0 − 1 outcome,this computation simplifies. We order the predictor classes according to theproportion falling in outcome class 1.
Then we split this predictor as if itwere an ordered predictor. One can show this gives the optimal split, interms of cross-entropy or Gini index, among all possible 2q−1 −1 splits. Thisresult also holds for a quantitative outcome and square error loss—the categories are ordered by increasing mean of the outcome. Although intuitive,the proofs of these assertions are not trivial. The proof for binary outcomesis given in Breiman et al. (1984) and Ripley (1996); the proof for quantitative outcomes can be found in Fisher (1958).
For multicategory outcomes,no such simplifications are possible, although various approximations havebeen proposed (Loh and Vanichsetakul, 1988).The partitioning algorithm tends to favor categorical predictors withmany levels q; the number of partitions grows exponentially in q, and themore choices we have, the more likely we can find a good one for the dataat hand. This can lead to severe overfitting if q is large, and such variablesshould be avoided.The Loss MatrixIn classification problems, the consequences of misclassifying observationsare more serious in some classes than others.
For example, it is probablyworse to predict that a person will not have a heart attack when he/sheactually will, than vice versa. To account for this, we define a K × K lossmatrix L, with Lkk′ being the loss incurred for classifying a class k observation as class k ′ . Typically no loss is incurred for correct classifications,9.2 Tree-Based Methods311that is, Lkk = 0 ∀k. To incorporatePthe losses into the modeling process,we could modify the Gini index to k6=k′ Lkk′ p̂mk p̂mk′ ; this would be theexpected loss incurred by the randomized rule. This works for the multiclass case, but in the two-class case has no effect, since the coefficient ofp̂mk p̂mk′ is Lkk′ + Lk′ k . For two classes a better approach is to weight theobservations in class k by Lkk′ .
This can be used in the multiclass case onlyif, as a function of k, Lkk′ doesn’t depend on k ′ . Observation weighting canbe used with the deviance as well. The effect of observation weighting is toalter the prior probability on the classes. In a terminal node,Pthe empiricalBayes rule implies that we classify to class k(m) = arg mink ℓ Lℓk p̂mℓ .Missing Predictor ValuesSuppose our data has some missing predictor values in some or all of thevariables. We might discard any observation with some missing values, butthis could lead to serious depletion of the training set. Alternatively wemight try to fill in (impute) the missing values, with say the mean of thatpredictor over the nonmissing observations. For tree-based models, thereare two better approaches.
The first is applicable to categorical predictors:we simply make a new category for “missing.” From this we might discover that observations with missing values for some measurement behavedifferently than those with nonmissing values. The second more generalapproach is the construction of surrogate variables. When considering apredictor for a split, we use only the observations for which that predictoris not missing. Having chosen the best (primary) predictor and split point,we form a list of surrogate predictors and split points. The first surrogateis the predictor and corresponding split point that best mimics the split ofthe training data achieved by the primary split.
The second surrogate isthe predictor and corresponding split point that does second best, and soon. When sending observations down the tree either in the training phaseor during prediction, we use the surrogate splits in order, if the primarysplitting predictor is missing. Surrogate splits exploit correlations betweenpredictors to try and alleviate the effect of missing data. The higher the correlation between the missing predictor and the other predictors, the smallerthe loss of information due to the missing value. The general problem ofmissing data is discussed in Section 9.6.Why Binary Splits?Rather than splitting each node into just two groups at each stage (asabove), we might consider multiway splits into more than two groups.
Whilethis can sometimes be useful, it is not a good general strategy. The problemis that multiway splits fragment the data too quickly, leaving insufficientdata at the next level down. Hence we would want to use such splits onlywhen needed. Since multiway splits can be achieved by a series of binarysplits, the latter are preferred.3129. Additive Models, Trees, and Related MethodsOther Tree-Building ProceduresThe discussion above focuses on the CART (classification and regressiontree) implementation of trees. The other popular methodology is ID3 andits later versions, C4.5 and C5.0 (Quinlan, 1993).
Early versions of theprogram were limited to categorical predictors, and used a top-down rulewith no pruning. With more recent developments, C5.0 has become quitesimilar to CART. The most significant feature unique to C5.0 is a schemefor deriving rule sets. After a tree is grown, the splitting rules that define theterminal nodes can sometimes be simplified: that is, one or more conditioncan be dropped without changing the subset of observations that fall inthe node. We end up with a simplified set of rules defining each terminalnode; these no longer follow a tree structure, but their simplicity mightmake them more attractive to the user.Linear Combination SplitsRather than restricting splits to be of thePform Xj ≤ s, one can allow splitsalong linear combinations of the formaj Xj ≤ s. The weights aj andsplit point s are optimized to minimize the relevant criterion (such as theGini index).
While this can improve the predictive power of the tree, it canhurt interpretability. Computationally, the discreteness of the split pointsearch precludes the use of a smooth optimization for the weights. A betterway to incorporate linear combination splits is in the hierarchical mixturesof experts (HME) model, the topic of Section 9.5.Instability of TreesOne major problem with trees is their high variance. Often a small changein the data can result in a very different series of splits, making interpretation somewhat precarious. The major reason for this instability is thehierarchical nature of the process: the effect of an error in the top splitis propagated down to all of the splits below it. One can alleviate this tosome degree by trying to use a more stable split criterion, but the inherentinstability is not removed.
It is the price to be paid for estimating a simple,tree-based structure from the data. Bagging (Section 8.7) averages manytrees to reduce this variance.Lack of SmoothnessAnother limitation of trees is the lack of smoothness of the prediction surface, as can be seen in the bottom right panel of Figure 9.2.
In classificationwith 0/1 loss, this doesn’t hurt much, since bias in estimation of the classprobabilities has a limited effect. However, this can degrade performancein the regression setting, where we would normally expect the underlyingfunction to be smooth. The MARS procedure, described in Section 9.4,9.2 Tree-Based Methods313TABLE 9.3. Spam data: confusion rates for the 17-node tree (chosen by cross–validation) on the test data. Overall error rate is 9.3%.TrueemailspamPredictedemailspam57.3%4.0%5.3% 33.4%can be viewed as a modification of CART designed to alleviate this lack ofsmoothness.Difficulty in Capturing Additive StructureAnother problem with trees is their difficulty in modeling additive structure. In regression, suppose, for example, that Y = c1 I(X1 < t1 )+c2 I(X2 <t2 ) + ε where ε is zero-mean noise.
Then a binary tree might make its firstsplit on X1 near t1 . At the next level down it would have to split both nodeson X2 at t2 in order to capture the additive structure. This might happenwith sufficient data, but the model is given no special encouragement to findsuch structure. If there were ten rather than two additive effects, it wouldtake many fortuitous splits to recreate the structure, and the data analystwould be hard pressed to recognize it in the estimated tree. The “blame”here can again be attributed to the binary tree structure, which has bothadvantages and drawbacks.
Again the MARS method (Section 9.4) givesup this tree structure in order to capture additive structure.9.2.5 Spam Example (Continued)We applied the classification tree methodology to the spam example introduced earlier. We used the deviance measure to grow the tree and misclassification rate to prune it. Figure 9.4 shows the 10-fold cross-validationerror rate as a function of the size of the pruned tree, along with ±2 standard errors of the mean, from the ten replications.
The test error curve isshown in orange. Note that the cross-validation error rates are indexed bya sequence of values of α and not tree size; for trees grown in different folds,a value of α might imply different sizes. The sizes shown at the base of theplot refer to |Tα |, the sizes of the pruned original tree.The error flattens out at around 17 terminal nodes, giving the pruned treein Figure 9.5.