The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 80
Текст из файла (страница 80)
Each iterationusually reduces the training risk L(fM ), so that for M large enough this riskcan be made arbitrarily small. However, fitting the training data too wellcan lead to overfitting, which degrades the risk on future predictions. Thus,there is an optimal number M ∗ minimizing future risk that is applicationdependent. A convenient way to estimate M ∗ is to monitor prediction riskas a function of M on a validation sample. The value of M that minimizesthis risk is taken to be an estimate of M ∗ . This is analogous to the earlystopping strategy often used with neural networks (Section 11.4).10.12.1 ShrinkageControlling the value of M is not the only possible regularization strategy.As with ridge regression and neural networks, shrinkage techniques can beemployed as well (see Sections 3.4.1 and 11.5). The simplest implementationof shrinkage in the context of boosting is to scale the contribution of eachtree by a factor 0 < ν < 1 when it is added to the current approximation.That is, line 2(d) of Algorithm 10.3 is replaced byfm (x) = fm−1 (x) + ν ·JXj=1γjm I(x ∈ Rjm ).(10.41)The parameter ν can be regarded as controlling the learning rate of theboosting procedure.
Smaller values of ν (more shrinkage) result in largertraining risk for the same number of iterations M . Thus, both ν and Mcontrol prediction risk on the training data. However, these parameters do10.12 Regularization365not operate independently.
Smaller values of ν lead to larger values of Mfor the same training risk, so that there is a tradeoff between them.Empirically it has been found (Friedman, 2001) that smaller values of νfavor better test error, and require correspondingly larger values of M . Infact, the best strategy appears to be to set ν to be very small (ν < 0.1)and then choose M by early stopping. This yields dramatic improvements(over no shrinkage ν = 1) for regression and for probability estimation.
Thecorresponding improvements in misclassification risk via (10.20) are less,but still substantial. The price paid for these improvements is computational: smaller values of ν give rise to larger values of M , and computationis proportional to the latter. However, as seen below, many iterations aregenerally computationally feasible even on very large data sets.
This ispartly due to the fact that small trees are induced at each step with nopruning.Figure 10.11 shows test error curves for the simulated example (10.2) ofFigure 10.2. A gradient boosted model (MART) was trained using binomialdeviance, using either stumps or six terminal-node trees, and with or without shrinkage. The benefits of shrinkage are evident, especially when thebinomial deviance is tracked. With shrinkage, each test error curve reachesa lower value, and stays there for many iterations.Section 16.2.1 draws a connection between forward stagewise shrinkagein boosting and the use of an L1 penalty for regularizing model parameters (the “lasso”).
We argue that L1 penalties may be superior to the L2penalties used by methods such as the support vector machine.10.12.2 SubsamplingWe saw in Section 8.7 that bootstrap averaging (bagging) improves theperformance of a noisy classifier through averaging. Chapter 15 discussesin some detail the variance-reduction mechanism of this sampling followedby averaging. We can exploit the same device in gradient boosting, bothto improve performance and computational efficiency.With stochastic gradient boosting (Friedman, 1999), at each iteration wesample a fraction η of the training observations (without replacement),and grow the next tree using that subsample. The rest of the algorithm isidentical. A typical value for η can be 21 , although for large N , η can besubstantially smaller than 12 .Not only does the sampling reduce the computing time by the samefraction η, but in many cases it actually produces a more accurate model.Figure 10.12 illustrates the effect of subsampling using the simulatedexample (10.2), both as a classification and as a regression example.
Wesee in both cases that sampling along with shrinkage slightly outperformedthe rest. It appears here that subsampling without shrinkage does poorly.36610. Boosting and Additive Trees0.5StumpsMisclassification Error500100015000.40.30.20.12000050010001500Boosting Iterations6-Node TreesDeviance6-Node TreesMisclassification Error0.5Boosting Iterations0.20.30.4No shrinkageShrinkage=0.60.00.00.51.01.5Test Set Misclassification ErrorNo shrinkageShrinkage=0.620000.12.00Test Set DevianceNo shrinkageShrinkage=0.20.00.51.01.5Test Set Misclassification ErrorNo shrinkageShrinkage=0.20.0Test Set Deviance2.0StumpsDeviance050010001500Boosting Iterations20000500100015002000Boosting IterationsFIGURE 10.11.
Test error curves for simulated example (10.2) of Figure 10.9,using gradient boosting (MART). The models were trained using binomial deviance, either stumps or six terminal-node trees, and with or without shrinkage.The left panels report test deviance, while the right panels show misclassificationerror. The beneficial effect of shrinkage can be seen in all cases, especially fordeviance in the left panels.10.13 Interpretation3674−Node TreesAbsolute Error0.450.40No shrinkageShrink=0.1Sample=0.5Shrink=0.1 Sample=0.50.300.35Test Set Absolute Error1.21.00.80.60.4Test Set Deviance1.40.50Deviance0200400600800Boosting Iterations100002004006008001000Boosting IterationsFIGURE 10.12. Test-error curves for the simulated example (10.2), showingthe effect of stochasticity.
For the curves labeled “Sample= 0.5”, a different 50%subsample of the training data was used each time a tree was grown. In the leftpanel the models were fit by gbm using a binomial deviance loss function; in theright-hand panel using square-error loss.The downside is that we now have four parameters to set: J, M , ν andη. Typically some early explorations determine suitable values for J, ν andη, leaving M as the primary parameter.10.13 InterpretationSingle decision trees are highly interpretable.
The entire model can be completely represented by a simple two-dimensional graphic (binary tree) thatis easily visualized. Linear combinations of trees (10.28) lose this importantfeature, and must therefore be interpreted in a different way.10.13.1Relative Importance of Predictor VariablesIn data mining applications the input predictor variables are seldom equallyrelevant.
Often only a few of them have substantial influence on the response; the vast majority are irrelevant and could just as well have notbeen included. It is often useful to learn the relative importance or contribution of each input variable in predicting the response.36810. Boosting and Additive TreesFor a single decision tree T , Breiman et al. (1984) proposedIℓ2 (T ) =J−1Xı̂2t I(v(t) = ℓ)(10.42)t=1as a measure of relevance for each predictor variable Xℓ . The sum is overthe J − 1 internal nodes of the tree.
At each such node t, one of the inputvariables Xv(t) is used to partition the region associated with that node intotwo subregions; within each a separate constant is fit to the response values.The particular variable chosen is the one that gives maximal estimatedimprovement ı̂2t in squared error risk over that for a constant fit over theentire region. The squared relative importance of variable Xℓ is the sum ofsuch squared improvements over all internal nodes for which it was chosenas the splitting variable.This importance measure is easily generalized to additive tree expansions(10.28); it is simply averaged over the treesIℓ2 =M1 X 2I (Tm ).M m=1 ℓ(10.43)Due to the stabilizing effect of averaging, this measure turns out to be morereliable than is its counterpart (10.42) for a single tree.
Also, because ofshrinkage (Section 10.12.1) the masking of important variables by otherswith which they are highly correlated is much less of a problem. Notethat (10.42) and (10.43) refer to squared relevance; the actual relevancesare their respective square roots. Since these measures are relative, it iscustomary to assign the largest a value of 100 and then scale the othersaccordingly.
Figure 10.6 shows the relevant importance of the 57 inputs inpredicting spam versus email.For K-class classification, K separate models fk (x), k = 1, 2, . . . , K areinduced, each consisting of a sum of treesfk (x) =MXTkm (x).(10.44)m=1In this case (10.43) generalizes to2Iℓk=M1 X 2I (Tkm ).M m=1 ℓ(10.45)Here Iℓk is the relevance of Xℓ in separating the class k observations fromthe other classes. The overall relevance of Xℓ is obtained by averaging overall of the classesK1 X 2Iℓ2 =Iℓk .(10.46)Kk=110.13 Interpretation369Figures 10.23 and 10.24 illustrate the use of these averaged and separaterelative importances.10.13.2 Partial Dependence PlotsAfter the most relevant variables have been identified, the next step is toattempt to understand the nature of the dependence of the approximationf (X) on their joint values.
Graphical renderings of the f (X) as a functionof its arguments provides a comprehensive summary of its dependence onthe joint values of the input variables.Unfortunately, such visualization is limited to low-dimensional views.We can easily display functions of one or two arguments, either continuousor discrete (or mixed), in a variety of different ways; this book is filledwith such displays. Functions of slightly higher dimensions can be plottedby conditioning on particular sets of values of all but one or two of thearguments, producing a trellis of plots (Becker et al., 1996).1For more than two or three variables, viewing functions of the corresponding higher-dimensional arguments is more difficult. A useful alternative can sometimes be to view a collection of plots, each one of which showsthe partial dependence of the approximation f (X) on a selected small subset of the input variables.
Although such a collection can seldom provide acomprehensive depiction of the approximation, it can often produce helpfulclues, especially when f (x) is dominated by low-order interactions (10.40).Consider the subvector XS of ℓ < p of the input predictor variables X T =(X1 , X2 , . . . , Xp ), indexed by S ⊂ {1, 2, . .