The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 78
Текст из файла (страница 78)
Thevariable names are written on the vertical axis.Partial Dependence0.00.20.40.60.81.00.00.20.40.6-1.0-0.6Partial Dependence-0.6-1.0-0.2 0.0 0.2remove-0.2 0.0 0.2!Partial Dependence355-0.2 0.0 0.2 0.4 0.6 0.8 1.0Partial Dependence-0.2 0.0 0.2 0.4 0.6 0.8 1.010.9 Boosting Trees0.00.20.40.60.81.00.00.51.0edu1.52.02.53.0hpFIGURE 10.7. Partial dependence of log-odds of spam on four important predictors. The red ticks at the base of the plots are deciles of the input variable.1.00.50.0-0.5-1.01.00.80.6!0.40.23.02.52.01.51.00.5hpFIGURE 10.8.
Partial dependence of the log-odds of spam vs. email as a function of joint frequencies of hp and the character !.35610. Boosting and Additive TreesThus a tree can be formally expressed asJXT (x; Θ) =j=1γj I(x ∈ Rj ),(10.25)with parameters Θ = {Rj , γj }J1 . J is usually treated as a meta-parameter.The parameters are found by minimizing the empirical riskΘ̂ = arg minΘJXXL(yi , γj ).(10.26)j=1 xi ∈RjThis is a formidable combinatorial optimization problem, and we usuallysettle for approximate suboptimal solutions.
It is useful to divide the optimization problem into two parts:Finding γj given Rj : Given the Rj , estimating the γj is typically trivial,and often γ̂j = ȳj , the mean of the yi falling in region Rj . For misclassification loss, γ̂j is the modal class of the observations falling inregion Rj .Finding Rj : This is the difficult part, for which approximate solutions arefound. Note also that finding the Rj entails estimating the γj as well.A typical strategy is to use a greedy, top-down recursive partitioningalgorithm to find the Rj .
In addition, it is sometimes necessary toapproximate (10.26) by a smoother and more convenient criterion foroptimizing the Rj :Θ̃ = arg minΘNXL̃(yi , T (xi , Θ)).(10.27)i=1Then given the R̂j = R̃j , the γj can be estimated more preciselyusing the original criterion.In Section 9.2 we described such a strategy for classification trees. The Giniindex replaced misclassification loss in the growing of the tree (identifyingthe Rj ).The boosted tree model is a sum of such trees,fM (x) =MXT (x; Θm ),(10.28)m=1induced in a forward stagewise manner (Algorithm 10.2). At each step inthe forward stagewise procedure one must solveΘ̂m = arg minΘmNXi=1L (yi , fm−1 (xi ) + T (xi ; Θm ))(10.29)10.9 Boosting Trees357for the region set and constants Θm = {Rjm , γjm }J1 m of the next tree, giventhe current model fm−1 (x).Given the regions Rjm , finding the optimal constants γjm in each regionis typically straightforward:XL (yi , fm−1 (xi ) + γjm ) .(10.30)γ̂jm = arg minγjmxi ∈RjmFinding the regions is difficult, and even more difficult than for a singletree.
For a few special cases, the problem simplifies.For squared-error loss, the solution to (10.29) is no harder than for asingle tree. It is simply the regression tree that best predicts the currentresiduals yi − fm−1 (xi ), and γ̂jm is the mean of these residuals in eachcorresponding region.For two-class classification and exponential loss, this stagewise approachgives rise to the AdaBoost method for boosting classification trees (Algorithm 10.1). In particular, if the trees T (x; Θm ) are restricted to be scaledclassification trees, then we showed in Section 10.4 that the solution toPN(m)(10.29) is the tree that minimizes the weighted error rate i=1 wi I(yi 6=(m)= e−yi fm−1 (xi ) . By a scaled classificationT (xi ; Θm )) with weights witree, we mean βm T (x; Θm ), with the restriction that γjm ∈ {−1, 1}).Without this restriction, (10.29) still simplifies for exponential loss to aweighted exponential criterion for the new tree:Θ̂m = arg minΘmNX(m)wiexp[−yi T (xi ; Θm )].(10.31)i=1It is straightforward to implement a greedy recursive-partitioning algorithmusing this weighted exponential loss as a splitting criterion.
Given the Rjm ,one can show (Exercise 10.7) that the solution to (10.30) is the weightedlog-odds in each corresponding regionP(m)I(yi = 1)1xi ∈Rjm wiγ̂jm = log P.(10.32)(m)2I(yi = −1)xi ∈Rjm wiThis requires a specialized tree-growing algorithm; in practice, we preferthe approximation presented below that uses a weighted least squares regression tree.Using loss criteria such as the absolute error or the Huber loss (10.23) inplace of squared-error loss for regression, and the deviance (10.22) in placeof exponential loss for classification, will serve to robustify boosting trees.Unfortunately, unlike their nonrobust counterparts, these robust criteriado not give rise to simple fast boosting algorithms.For more general loss criteria the solution to (10.30), given the Rjm ,is typically straightforward since it is a simple “location” estimate. For35810. Boosting and Additive Treesabsolute loss it is just the median of the residuals in each respective region.For the other criteria fast iterative algorithms exist for solving (10.30),and usually their faster “single-step” approximations are adequate.
Theproblem is tree induction. Simple fast algorithms do not exist for solving(10.29) for these more general loss criteria, and approximations like (10.27)become essential.10.10 Numerical Optimization via GradientBoostingFast approximate algorithms for solving (10.29) with any differentiable losscriterion can be derived by analogy to numerical optimization. The loss inusing f (x) to predict y on the training data isL(f ) =NXL(yi , f (xi )).(10.33)i=1The goal is to minimize L(f ) with respect to f , where here f (x) is constrained to be a sum of trees (10.28).
Ignoring this constraint, minimizing(10.33) can be viewed as a numerical optimizationf̂ = arg min L(f ),f(10.34)where the “parameters” f ∈ IRN are the values of the approximating function f (xi ) at each of the N data points xi :f = {f (x1 ), f (x2 ), . . . , f (xN )}T .Numerical optimization procedures solve (10.34) as a sum of componentvectorsMXhm , hm ∈ IRN ,fM =m=0where f0 = h0 is an initial guess, and each successive fm is induced basedon the current parameter vector fm−1 , which is the sum of the previouslyinduced updates. Numerical optimization methods differ in their prescriptions for computing each increment vector hm (“step”).10.10.1 Steepest DescentSteepest descent chooses hm = −ρm gm where ρm is a scalar and gm ∈ IRNis the gradient of L(f ) evaluated at f = fm−1 .
The components of thegradient gm are∂L(yi , f (xi ))(10.35)gim =∂f (xi )f (xi )=fm−1 (xi )10.10 Numerical Optimization via Gradient Boosting359The step length ρm is the solution toρm = arg min L(fm−1 − ρgm ).(10.36)ρThe current solution is then updatedfm = fm−1 − ρm gmand the process repeated at the next iteration. Steepest descent can beviewed as a very greedy strategy, since −gm is the local direction in IRNfor which L(f ) is most rapidly decreasing at f = fm−1 .10.10.2 Gradient BoostingForward stagewise boosting (Algorithm 10.2) is also a very greedy strategy.At each step the solution tree is the one that maximally reduces (10.29),given the current model fm−1 and its fits fm−1 (xi ). Thus, the tree predictions T (xi ; Θm ) are analogous to the components of the negative gradient(10.35).
The principal difference between them is that the tree componentstm = {T (x1 ; Θm ), . . . , T (xN ; Θm )}T are not independent. They are constrained to be the predictions of a Jm -terminal node decision tree, whereasthe negative gradient is the unconstrained maximal descent direction.The solution to (10.30) in the stagewise approach is analogous to the linesearch (10.36) in steepest descent.
The difference is that (10.30) performsa separate line search for those components of tm that correspond to eachseparate terminal region {T (xi ; Θm )}xi ∈Rjm .If minimizing loss on the training data (10.33) were the only goal, steepest descent would be the preferred strategy. The gradient (10.35) is trivialto calculate for any differentiable loss function L(y, f (x)), whereas solving(10.29) is difficult for the robust criteria discussed in Section 10.6. Unfortunately the gradient (10.35) is defined only at the training data points xi ,whereas the ultimate goal is to generalize fM (x) to new data not represented in the training set.A possible resolution to this dilemma is to induce a tree T (x; Θm ) at themth iteration whose predictions tm are as close as possible to the negativegradient.