The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 76
Текст из файла (страница 76)
The factor e−βm in (10.15) multiplies all weights by thesame value, so it has no effect. Thus (10.15) is equivalent to line 2(d) ofAlgorithm 10.1.One can view line 2(a) of the Adaboost.M1 algorithm as a method forapproximately solving the minimization in (10.11) and hence (10.10). Hencewe conclude that AdaBoost.M1 minimizes the exponential loss criterion(10.8) via a forward-stagewise additive modeling approach.Figure 10.3 shows the training-set misclassification error rate and average exponential loss for the simulated data problem (10.2) of Figure 10.2.The training-set misclassification error decreases to zero at around 250 iterations (and remains there), but the exponential loss keeps decreasing.Notice also in Figure 10.2 that the test-set misclassification error continuesto improve after iteration 250.
Clearly Adaboost is not optimizing trainingset misclassification error; the exponential loss is more sensitive to changesin the estimated class probabilities.3450.60.4Exponential Loss0.2Training Error0.81.010.5 Why Exponential Loss?0.0Misclassification Rate0100200300400Boosting IterationsFIGURE 10.3. Simulated data, boosting with stumps: misclassificationerrorPexp(−yf(xrate on the training set, and average exponential loss: (1/N ) Nii )).i=1After about 250 iterations, the misclassification error is zero, while the exponentialloss continues to decrease.10.5 Why Exponential Loss?The AdaBoost.M1 algorithm was originally motivated from a very different perspective than presented in the previous section. Its equivalence toforward stagewise additive modeling based on exponential loss was onlydiscovered five years after its inception.
By studying the properties of theexponential loss criterion, one can gain insight into the procedure and discover ways it might be improved.The principal attraction of exponential loss in the context of additivemodeling is computational; it leads to the simple modular reweighting AdaBoost algorithm. However, it is of interest to inquire about its statisticalproperties.
What does it estimate and how well is it being estimated? Thefirst question is answered by seeking its population minimizer.It is easy to show (Friedman et al., 2000) thatf ∗ (x) = arg min EY |x (e−Y f (x) ) =f (x)Pr(Y = 1|x)1log,2Pr(Y = −1|x)(10.16)34610. Boosting and Additive Treesor equivalentlyPr(Y = 1|x) =1.1+Thus, the additive expansion produced by AdaBoost is estimating onehalf the log-odds of P (Y = 1|x). This justifies using its sign as the classification rule in (10.1).Another loss criterion with the same population minimizer is the binomial negative log-likelihood or deviance (also known as cross-entropy),interpreting f as the logit transform.
Letp(x) = Pr(Y = 1 | x) =e−2f ∗ (x)1ef (x)=+ ef (x)1 + e−2f (x)e−f (x)(10.17)and define Y ′ = (Y + 1)/2 ∈ {0, 1}. Then the binomial log-likelihood lossfunction isl(Y, p(x)) = Y ′ log p(x) + (1 − Y ′ ) log(1 − p(x)),or equivalently the deviance is−l(Y, f (x)) = log 1 + e−2Y f (x) .(10.18)Since the population maximizer of log-likelihood is at the true probabilitiesp(x) = Pr(Y = 1 | x), we see from (10.17) that the population minimizers ofthe deviance EY |x [−l(Y, f (x))] and EY |x [e−Y f (x) ] are the same. Thus, usingeither criterion leads to the same solution at the population level. Note thate−Y f itself is not a proper log-likelihood, since it is not the logarithm ofany probability mass function for a binary random variable Y ∈ {−1, 1}.10.6 Loss Functions and RobustnessIn this section we examine the different loss functions for classification andregression more closely, and characterize them in terms of their robustnessto extreme data.Robust Loss Functions for ClassificationAlthough both the exponential (10.8) and binomial deviance (10.18) yieldthe same solution when applied to the population joint distribution, thesame is not true for finite data sets.
Both criteria are monotone decreasingfunctions of the “margin” yf (x). In classification (with a −1/1 response)the margin plays a role analogous to the residuals y−f (x) in regression. Theclassification rule G(x) = sign[f (x)] implies that observations with positivemargin yi f (xi ) > 0 are classified correctly whereas those with negativemargin yi f (xi ) < 0 are misclassified. The decision boundary is defined by3.010.6 Loss Functions and Robustness3471.50.00.51.0Loss2.02.5MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector−2−1012y·fFIGURE 10.4.
Loss functions for two-class classification. The response isy = ±1; the prediction is f , with class prediction sign(f ). The losses aremisclassification: I(sign(f ) 6= y); exponential: exp(−yf ); binomial deviance:log(1 + exp(−2yf )); squared error: (y − f )2 ; and support vector: (1 − yf )+ (seeSection 12.3). Each function has been scaled so that it passes through the point(0, 1).f (x) = 0. The goal of the classification algorithm is to produce positivemargins as frequently as possible.
Any loss criterion used for classificationshould penalize negative margins more heavily than positive ones sincepositive margin observations are already correctly classified.Figure 10.4 shows both the exponential (10.8) and binomial deviancecriteria as a function of the margin y · f (x). Also shown is misclassificationloss L(y, f (x)) = I(y · f (x) < 0), which gives unit penalty for negative margin values, and no penalty at all for positive ones. Both the exponentialand deviance loss can be viewed as monotone continuous approximationsto misclassification loss. They continuously penalize increasingly negativemargin values more heavily than they reward increasingly positive ones.The difference between them is in degree.
The penalty associated with binomial deviance increases linearly for large increasingly negative margin,whereas the exponential criterion increases the influence of such observations exponentially.At any point in the training process the exponential criterion concentrates much more influence on observations with large negative margins.Binomial deviance concentrates relatively less influence on such observa-34810.
Boosting and Additive Treestions, more evenly spreading the influence among all of the data. It istherefore far more robust in noisy settings where the Bayes error rate isnot close to zero, and especially in situations where there is misspecificationof the class labels in the training data. The performance of AdaBoost hasbeen empirically observed to dramatically degrade in such situations.Also shown in the figure is squared-error loss. The minimizer of the corresponding risk on the population isf ∗ (x) = arg min EY |x (Y −f (x))2 = E(Y | x) = 2·Pr(Y = 1 | x)−1.
(10.19)f (x)As before the classification rule is G(x) = sign[f (x)]. Squared-error lossis not a good surrogate for misclassification error. As seen in Figure 10.4, itis not a monotone decreasing function of increasing margin yf (x). For margin values yi f (xi ) > 1 it increases quadratically, thereby placing increasinginfluence (error) on observations that are correctly classified with increasing certainty, thereby reducing the relative influence of those incorrectlyclassified yi f (xi ) < 0.
Thus, if class assignment is the goal, a monotone decreasing criterion serves as a better surrogate loss function. Figure 12.4 onpage 426 in Chapter 12 includes a modification of quadratic loss, the “Huberized” square hinge loss (Rosset et al., 2004b), which enjoys the favorableproperties of the binomial deviance, quadratic loss and the SVM hinge loss.It has the same population minimizer as the quadratic (10.19), is zero fory · f (x) > 1, and becomes linear for y · f (x) < −1. Since quadratic functionsare easier to compute with than exponentials, our experience suggests thisto be a useful alternative to the binomial deviance.With K-class classification, the response Y takes values in the unorderedset G = {G1 , . .
. , Gk } (see Sections 2.4 and 4.4). We now seek a classifierG(x) taking values in G. It is sufficient to know the class conditional probabilities pk (x) = Pr(Y = Gk |x), k = 1, 2, . . . , K, for then the Bayes classifierisG(x) = Gk where k = arg max pℓ (x).(10.20)ℓIn principal, though, we need not learn the pk (x), but simply which one islargest.
However, in data mining applications the interest is often more inthe class probabilities pℓ (x), ℓ = 1, . . . , K themselves, rather than in performing a class assignment. As in Section 4.4, the logistic model generalizesnaturally to K classes,efk (x),pk (x) = PKfl (x)l=1 e(10.21)which ensures that 0 ≤ pk (x) ≤ 1 and that they sum to one.
Note thathere we have K different functions, one per class. There is a redundancyin the functions fk (x), since adding an arbitrary h(x) to each leaves themodel unchanged. Traditionally one of them is set to zero: for example,10.6 Loss Functions and Robustness349fK (x) = 0, as in (4.17). Here we prefer to retain the symmetry, and imposePKthe constraint k=1 fk (x) = 0. The binomial deviance extends naturallyto the K-class multinomial deviance loss function:L(y, p(x))==−−KXk=1KXk=1I(y = Gk ) log pk (x)I(y = Gk )fk (x) + logKXℓ=1efℓ (x)!.
(10.22)As in the two-class case, the criterion (10.22) penalizes incorrect predictionsonly linearly in their degree of incorrectness.Zhu et al. (2005) generalize the exponential loss for K-class problems.See Exercise 10.5 for details.Robust Loss Functions for RegressionIn the regression setting, analogous to the relationship between exponentialloss and binomial log-likelihood is the relationship between squared-errorloss L(y, f (x)) = (y −f (x))2 and absolute loss L(y, f (x)) = | y −f (x) |. Thepopulation solutions are f (x) = E(Y |x) for squared-error loss, and f (x) =median(Y |x) for absolute loss; for symmetric error distributions these arethe same.