The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 75
Текст из файла (страница 75)
. . , N .2. For m = 1 to M :(a) Fit a classifier Gm (x) to the training data using weights wi .(b) Computeerrm =PNi=1wi I(yi 6= Gm (xi )).PNi=1 wi(c) Compute αm = log((1 − errm )/errm ).(d) Set wi ← wi · exp[αm · I(yi 6= Gm (xi ))], i = 1, 2, . . . , N .hPiM3.
Output G(x) = signαG(x).m=1 m mto concentrate on those training observations that are missed by previousones in the sequence.Algorithm 10.1 shows the details of the AdaBoost.M1 algorithm. Thecurrent classifier Gm (x) is induced on the weighted observations at line 2a.The resulting weighted error rate is computed at line 2b. Line 2c calculatesthe weight αm given to Gm (x) in producing the final classifier G(x) (line3). The individual weights of each of the observations are updated for thenext iteration at line 2d. Observations misclassified by Gm (x) have theirweights scaled by a factor exp(αm ), increasing their relative influence forinducing the next classifier Gm+1 (x) in the sequence.The AdaBoost.M1 algorithm is known as “Discrete AdaBoost” in Friedman et al.
(2000), because the base classifier Gm (x) returns a discrete classlabel. If the base classifier instead returns a real-valued prediction (e.g.,a probability mapped to the interval [−1, 1]), AdaBoost can be modifiedappropriately (see “Real AdaBoost” in Friedman et al. (2000)).The power of AdaBoost to dramatically increase the performance of evena very weak classifier is illustrated in Figure 10.2.
The features X1 , . . . , X10are standard independent Gaussian, and the deterministic target Y is defined byP101if j=1 Xj2 > χ210 (0.5),Y =(10.2)−1 otherwise.Here χ210 (0.5) = 9.34 is the median of a chi-squared random variable with10 degrees of freedom (sum of squares of 10 standard Gaussians). There are2000 training cases, with approximately 1000 cases in each class, and 10,000test observations. Here the weak classifier is just a “stump”: a two terminalnode classification tree.
Applying this classifier alone to the training dataset yields a very poor test set error rate of 45.8%, compared to 50% for10. Boosting and Additive Trees0.53400.30.2244 Node Tree0.00.1Test Error0.4Single Stump0100200300400Boosting IterationsFIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps,as a function of the number of iterations. Also shown are the test error rate fora single stump, and a 244-node classification tree.random guessing. However, as boosting iterations proceed the error ratesteadily decreases, reaching 5.8% after 400 iterations. Thus, boosting thissimple very weak classifier reduces its prediction error rate by almost afactor of four.
It also outperforms a single large classification tree (errorrate 24.7%). Since its introduction, much has been written to explain thesuccess of AdaBoost in producing accurate classifiers. Most of this workhas centered on using classification trees as the “base learner” G(x), whereimprovements are often most dramatic. In fact, Breiman (NIPS Workshop,1996) referred to AdaBoost with trees as the “best off-the-shelf classifier inthe world” (see also Breiman (1998)). This is especially the case for datamining applications, as discussed more fully in Section 10.7 later in thischapter.10.1.1 Outline of This ChapterHere is an outline of the developments in this chapter:• We show that AdaBoost fits an additive model in a base learner,optimizing a novel exponential loss function.
This loss function is10.2 Boosting Fits an Additive Model341very similar to the (negative) binomial log-likelihood (Sections 10.2–10.4).• The population minimizer of the exponential loss function is shownto be the log-odds of the class probabilities (Section 10.5).• We describe loss functions for regression and classification that aremore robust than squared error or exponential loss (Section 10.6).• It is argued that decision trees are an ideal base learner for datamining applications of boosting (Sections 10.7 and 10.9).• We develop a class of gradient boosted models (GBMs), for boostingtrees with any loss function (Section 10.10).• The importance of “slow learning” is emphasized, and implementedby shrinkage of each new term that enters the model (Section 10.12),as well as randomization (Section 10.12.2).• Tools for interpretation of the fitted model are described (Section 10.13).10.2 Boosting Fits an Additive ModelThe success of boosting is really not very mysterious.
The key lies in expression (10.1). Boosting is a way of fitting an additive expansion in a setof elementary “basis” functions. Here the basis functions are the individualclassifiers Gm (x) ∈ {−1, 1}. More generally, basis function expansions takethe formMXβm b(x; γm ),(10.3)f (x) =m=1where βm , m = 1, 2, . . . , M are the expansion coefficients, and b(x; γ) ∈ IRare usually simple functions of the multivariate argument x, characterizedby a set of parameters γ. We discuss basis expansions in some detail inChapter 5.Additive expansions like this are at the heart of many of the learningtechniques covered in this book:• In single-hidden-layer neural networks (Chapter 11), b(x; γ) = σ(γ0 +γ1T x), where σ(t) = 1/(1 + e−t ) is the sigmoid function, and γ parameterizes a linear combination of the input variables.• In signal processing, wavelets (Section 5.9.1) are a popular choice withγ parameterizing the location and scale shifts of a “mother” wavelet.• Multivariate adaptive regression splines (Section 9.4) uses truncatedpower spline basis functions where γ parameterizes the variables andvalues for the knots.34210.
Boosting and Additive TreesAlgorithm 10.2 Forward Stagewise Additive Modeling.1. Initialize f0 (x) = 0.2. For m = 1 to M :(a) Compute(βm , γm ) = arg minβ,γNXL(yi , fm−1 (xi ) + βb(xi ; γ)).i=1(b) Set fm (x) = fm−1 (x) + βm b(x; γm ).• For trees, γ parameterizes the split variables and split points at theinternal nodes, and the predictions at the terminal nodes.Typically these models are fit by minimizing a loss function averagedover the training data, such as the squared-error or a likelihood-based lossfunction,!MNXXminβm b(xi ; γm ) .(10.4)L yi ,{βm ,γm }M1i=1m=1For many loss functions L(y, f (x)) and/or basis functions b(x; γ), this requires computationally intensive numerical optimization techniques. However, a simple alternative often can be found when it is feasible to rapidlysolve the subproblem of fitting just a single basis function,minβ,γNXL (yi , βb(xi ; γ)) .(10.5)i=110.3 Forward Stagewise Additive ModelingForward stagewise modeling approximates the solution to (10.4) by sequentially adding new basis functions to the expansion without adjusting theparameters and coefficients of those that have already been added.
This isoutlined in Algorithm 10.2. At each iteration m, one solves for the optimalbasis function b(x; γm ) and corresponding coefficient βm to add to the current expansion fm−1 (x). This produces fm (x), and the process is repeated.Previously added terms are not modified.For squared-error lossL(y, f (x)) = (y − f (x))2 ,(10.6)10.4 Exponential Loss and AdaBoost343one hasL(yi , fm−1 (xi ) + βb(xi ; γ))(yi − fm−1 (xi ) − βb(xi ; γ))2(rim − βb(xi ; γ))2 ,(10.7)==where rim = yi − fm−1 (xi ) is simply the residual of the current modelon the ith observation.
Thus, for squared-error loss, the term βm b(x; γm )that best fits the current residuals is added to the expansion at each step.This idea is the basis for “least squares” regression boosting discussed inSection 10.10.2. However, as we show near the end of the next section,squared-error loss is generally not a good choice for classification; hencethe need to consider other loss criteria.10.4 Exponential Loss and AdaBoostWe now show that AdaBoost.M1 (Algorithm 10.1) is equivalent to forwardstagewise additive modeling (Algorithm 10.2) using the loss functionL(y, f (x)) = exp(−y f (x)).(10.8)The appropriateness of this criterion is addressed in the next section.For AdaBoost the basis functions are the individual classifiers Gm (x) ∈{−1, 1}.
Using the exponential loss function, one must solve(βm , Gm ) = arg minβ,GNXexp[−yi (fm−1 (xi ) + β G(xi ))]i=1for the classifier Gm and corresponding coefficient βm to be added at eachstep. This can be expressed as(βm , Gm ) = arg minβ,GNX(m)wiexp(−β yi G(xi ))(10.9)i=1(m)(m)with wi= exp(−yi fm−1 (xi )). Since each widepends neither on βnor G(x), it can be regarded as a weight that is applied to each observation. This weight depends on fm−1 (xi ), and so the individual weight valueschange with each iteration m.The solution to (10.9) can be obtained in two steps.
First, for any valueof β > 0, the solution to (10.9) for Gm (x) isGm = arg minGNXi=1(m)wiI(yi 6= G(xi )),(10.10)34410. Boosting and Additive Treeswhich is the classifier that minimizes the weighted error rate in predictingy. This can be easily seen by expressing the criterion in (10.9) asXX(m)(m)e−β ·w i + eβ ·wi ,yi 6=G(xi )yi =G(xi )which in turn can be written asNNX X(m)(m)wi .wi I(yi 6= G(xi )) + e−β ·eβ − e−β ·(10.11)i=1i=1Plugging this Gm into (10.9) and solving for β one obtainsβm =1 − errm1log,2errmwhere errm is the minimized weighted error ratePN(m)w I(yi 6= Gm (xi ))errm = i=1 iPN.(m)i=1 wi(10.12)(10.13)The approximation is then updatedfm (x) = fm−1 (x) + βm Gm (x),which causes the weights for the next iteration to be(m+1)wi(m)= wi· e−βm yi Gm (xi ) .(10.14)Using the fact that −yi Gm (xi ) = 2 · I(yi 6= Gm (xi )) − 1, (10.14) becomes(m+1)wi(m)= wi· eαm I(yi 6=Gm (xi )) · e−βm ,(10.15)where αm = 2βm is the quantity defined at line 2c of AdaBoost.M1 (Algorithm 10.1).