The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 79
Текст из файла (страница 79)
Using squared error to measure closeness, this leads us toΘ̃m = arg minΘNXi=12(−gim − T (xi ; Θ)) .(10.37)That is, one fits the tree T to the negative gradient values (10.35) by leastsquares. As noted in Section 10.9 fast algorithms exist for least squaresdecision tree induction. Although the solution regions R̃jm to (10.37) willnot be identical to the regions Rjm that solve (10.29), it is generally similar enough to serve the same purpose. In any case, the forward stagewise36010.
Boosting and Additive TreesTABLE 10.2. Gradients for commonly used loss functions.SettingLoss Function−∂L(yi , f (xi ))/∂f (xi )Regression12 [yiyi − f (xi )Regression|yi − f (xi )|sign[yi − f (xi )]RegressionHuberyi − f (xi ) for |yi − f (xi )| ≤ δm− f (xi )]2δm sign[yi − f (xi )] for |yi − f (xi )| > δmwhere δm = αth-quantile{|yi − f (xi )|}ClassificationDeviancekth component: I(yi = Gk ) − pk (xi )boosting procedure, and top-down decision tree induction, are themselvesapproximation procedures. After constructing the tree (10.37), the corresponding constants in each region are given by (10.30).Table 10.2 summarizes the gradients for commonly used loss functions.For squared error loss, the negative gradient is just the ordinary residual−gim = yi − fm−1 (xi ), so that (10.37) on its own is equivalent to standardleast-squares boosting.
With absolute error loss, the negative gradient isthe sign of the residual, so at each iteration (10.37) fits the tree to thesign of the current residuals by least squares. For Huber M-regression, thenegative gradient is a compromise between these two (see the table).For classification the loss function is the multinomial deviance (10.22),and K least squares trees are constructed at each iteration. Each tree Tkmis fit to its respective negative gradient vector gkm ,−gikm==∂L (yi , f1m (xi ), . .
. , f1m (xi ))∂fkm (xi )I(yi = Gk ) − pk (xi ),(10.38)with pk (x) given by (10.21). Although K separate trees are built at eachiteration, they are related through (10.21). For binary classification (K =2), only one tree is needed (exercise 10.10).10.10.3 Implementations of Gradient BoostingAlgorithm 10.3 presents the generic gradient tree-boosting algorithm forregression. Specific algorithms are obtained by inserting different loss criteria L(y, f (x)).
The first line of the algorithm initializes to the optimalconstant model, which is just a single terminal node tree. The componentsof the negative gradient computed at line 2(a) are referred to as generalized or pseudo residuals, r. Gradients for commonly used loss functions aresummarized in Table 10.2.10.11 Right-Sized Trees for Boosting361Algorithm 10.3 Gradient Tree Boosting Algorithm.PN1. Initialize f0 (x) = arg minγ i=1 L(yi , γ).2. For m = 1 to M :(a) For i = 1, 2, . .
. , N compute∂L(yi , f (xi )).rim = −∂f (xi )f =fm−1(b) Fit a regression tree to the targets rim giving terminal regionsRjm , j = 1, 2, . . . , Jm .(c) For j = 1, 2, . . . , Jm computeXL (yi , fm−1 (xi ) + γ) .γjm = arg minγxi ∈Rjm(d) Update fm (x) = fm−1 (x) +3. Output fˆ(x) = fM (x).PJ mj=1γjm I(x ∈ Rjm ).The algorithm for classification is similar. Lines 2(a)–(d) are repeatedK times at each iteration m, once for each class using (10.38). The resultat line 3 is K different (coupled) tree expansions fkM (x), k = 1, 2, .
. . , K.These produce probabilities via (10.21) or do classification as in (10.20).Details are given in Exercise 10.9. Two basic tuning parameters are thenumber of iterations M and the sizes of each of the constituent treesJm , m = 1, 2, . . . , M .The original implementation of this algorithm was called MART for“multiple additive regression trees,” and was referred to in the first edition of this book.
Many of the figures in this chapter were produced byMART. Gradient boosting as described here is implemented in the R gbmpackage (Ridgeway, 1999, “Gradient Boosted Models”), and is freely available. The gbm package is used in Section 10.14.2, and extensively in Chapters 16 and 15. Another R implementation of boosting is mboost (Hothornand Bühlmann, 2006). A commercial implementation of gradient boosting/MART called TreeNet is available from Salford Systems, Inc.10.11 Right-Sized Trees for BoostingHistorically, boosting was considered to be a technique for combining models, here trees.
As such, the tree building algorithm was regarded as a36210. Boosting and Additive Treesprimitive that produced models to be combined by the boosting procedure. In this scenario, the optimal size of each tree is estimated separatelyin the usual manner when it is built (Section 9.2). A very large (oversized)tree is first induced, and then a bottom-up procedure is employed to pruneit to the estimated optimal number of terminal nodes.
This approach assumes implicitly that each tree is the last one in the expansion (10.28).Except perhaps for the very last tree, this is clearly a very poor assumption. The result is that trees tend to be much too large, especially duringthe early iterations. This substantially degrades performance and increasescomputation.The simplest strategy for avoiding this problem is to restrict all treesto be the same size, Jm = J ∀m. At each iteration a J-terminal noderegression tree is induced.
Thus J becomes a meta-parameter of the entireboosting procedure, to be adjusted to maximize estimated performance forthe data at hand.One can get an idea of useful values for J by considering the propertiesof the “target” functionη = arg min EXY L(Y, f (X)).f(10.39)Here the expected value is over the population joint distribution of (X, Y ).The target function η(x) is the one with minimum prediction risk on futuredata. This is the function we are trying to approximate.One relevant property of η(X) is the degree to which the coordinate variables X T = (X1 , X2 , .
. . , Xp ) interact with one another. This is capturedby its ANOVA (analysis of variance) expansionXXXη(X) =ηj (Xj ) +ηjk (Xj , Xk ) +ηjkl (Xj , Xk , Xl ) + · · · . (10.40)jjkjklThe first sum in (10.40) is over functions of only a single predictor variableXj . The particular functions ηj (Xj ) are those that jointly best approximateη(X) under the loss criterion being used. Each such ηj (Xj ) is called the“main effect” of Xj . The second sum is over those two-variable functionsthat when added to the main effects best fit η(X). These are called thesecond-order interactions of each respective variable pair (Xj , Xk ). Thethird sum represents third-order interactions, and so on.
For many problemsencountered in practice, low-order interaction effects tend to dominate.When this is the case, models that produce strong higher-order interactioneffects, such as large decision trees, suffer in accuracy.The interaction level of tree-based approximations is limited by the treesize J. Namely, no interaction effects of level greater than J − 1 are possible. Since boosted models are additive in the trees (10.28), this limitextends to them as well. Setting J = 2 (single split “decision stump”)produces boosted models with only main effects; no interactions are permitted.
With J = 3, two-variable interaction effects are also allowed, and10.11 Right-Sized Trees for Boosting3630.20.00.1Test Error0.30.4Stumps10 Node100 NodeAdaboost0100200300400Number of TermsFIGURE 10.9. Boosting with different sized trees, applied to the example (10.2)used in Figure 10.2. Since the generative model is additive, stumps perform thebest. The boosting algorithm used the binomial deviance loss in Algorithm 10.3;shown for comparison is the AdaBoost Algorithm 10.1.so on. This suggests that the value chosen for J should reflect the levelof dominant interactions of η(x).
This is of course generally unknown, butin most situations it will tend to be low. Figure 10.9 illustrates the effectof interaction order (choice of J) on the simulation example (10.2). Thegenerative function is additive (sum of quadratic monomials), so boostingmodels with J > 2 incurs unnecessary variance and hence the higher testerror.
Figure 10.10 compares the coordinate functions found by boostedstumps with the true functions.Although in many applications J = 2 will be insufficient, it is unlikelythat J > 10 will be required. Experience so far indicates that 4 ≤ J ≤ 8works well in the context of boosting, with results being fairly insensitiveto particular choices in this range. One can fine-tune the value for J bytrying several different values and choosing the one that produces the lowest risk on a validation sample.
However, this seldom provides significantimprovement over using J ≃ 6.36410. Boosting and Additive TreesCoordinate Functions for Additive Logistic Treesf1 (x1 )f6 (x6 )f2 (x2 )f7 (x7 )f3 (x3 )f8 (x8 )f4 (x4 )f5 (x5 )f9 (x9 )f10 (x10 )FIGURE 10.10. Coordinate functions estimated by boosting stumps for the simulated example used in Figure 10.9. The true quadratic functions are shown forcomparison.10.12 RegularizationBesides the size of the constituent trees, J, the other meta-parameter ofgradient boosting is the number of boosting iterations M .