The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 67
Текст из файла (страница 67)
Examples include business, address, internet,9.1 Generalized Additive Models301TABLE 9.1. Test data confusion matrix for the additive logistic regression modelfit to the spam training data. The overall test error rate is 5.5%.True Classemail (0)spam (1)Predicted Classemail (0) spam (1)58.3%2.5%3.0%36.3%free, and george. The idea was that these could be customized forindividual users.• 6 quantitative predictors—the percentage of characters in the emailthat match a given character.
The characters are ch;, ch(, ch[, ch!,ch$, and ch#.• The average length of uninterrupted sequences of capital letters:CAPAVE.• The length of the longest uninterrupted sequence of capital letters:CAPMAX.• The sum of the length of uninterrupted sequences of capital letters:CAPTOT.We coded spam as 1 and email as zero. A test set of size 1536 was randomlychosen, leaving 3065 observations in the training set. A generalized additivemodel was fit, using a cubic smoothing spline with a nominal four degrees offreedom for each predictor.
What this means is that for each predictor Xj ,the smoothing-spline parameter λj was chosen so that trace[Sj (λj )]−1 = 4,where Sj (λ) is the smoothing spline operator matrix constructed using theobserved values xij , i = 1, . . . , N . This is a convenient way of specifyingthe amount of smoothing in such a complex model.Most of the spam predictors have a very long-tailed distribution. Beforefitting the GAM model, we log-transformed each variable (actually log(x +0.1)), but the plots in Figure 9.1 are shown as a function of the originalvariables.The test error rates are shown in Table 9.1; the overall error rate is 5.3%.By comparison, a linear logistic regression has a test error rate of 7.6%.Table 9.2 shows the predictors that are highly significant in the additivemodel.For ease of interpretation, in Table 9.2 the contribution for each variableis decomposed into a linear component and the remaining nonlinear component.
The top block of predictors are positively correlated with spam,while the bottom block is negatively correlated. The linear component is aweighted least squares linear fit of the fitted curve on the predictor, whilethe nonlinear part is the residual. The linear component of an estimated3029.
Additive Models, Trees, and Related MethodsTABLE 9.2. Significant predictors from the additive model fit to the spam training data. The coefficients represent the linear part of fˆj , along with their standarderrors and Z-score. The nonlinear P-value is for a test of nonlinearity of fˆj .NameNum.dfouroverremoveinternetfreebusinesshplch!ch$CAPMAXCAPTOT5678161726525356573.93.94.04.03.93.83.84.03.93.84.0hpgeorge1999reedu25273745463.93.73.83.94.0CoefficientStd.
ErrorPositive effects0.5660.1140.2440.1950.9490.1830.5240.1760.5070.1270.7790.1860.0450.2500.6740.1281.4190.2800.2470.2280.7550.165Negative effects−1.4040.224−5.0030.744−0.6720.191−0.6200.133−1.1830.209Z ScoreNonlinearP -value4.9701.2495.2012.9744.0104.1790.1815.2835.0621.0804.5660.0520.0040.0930.0280.0650.1940.0020.1640.3540.0000.063−6.262−6.722−3.512−4.649−5.6470.1400.0450.0110.5970.000function is summarized by the coefficient, standard error and Z-score; thelatter is the coefficient divided by its standard error, and is consideredsignificant if it exceeds the appropriate quantile of a standard normal distribution. The column labeled nonlinear P -value is a test of nonlinearityof the estimated function.
Note, however, that the effect of each predictoris fully adjusted for the entire effects of the other predictors, not just fortheir linear parts. The predictors shown in the table were judged significant by at least one of the tests (linear or nonlinear) at the p = 0.01 level(two-sided).Figure 9.1 shows the estimated functions for the significant predictorsappearing in Table 9.2. Many of the nonlinear effects appear to account fora strong discontinuity at zero. For example, the probability of spam dropssignificantly as the frequency of george increases from zero, but then doesnot change much after that.
This suggests that one might replace each ofthe frequency predictors by an indicator variable for a zero count, and resortto a linear logistic model. This gave a test error rate of 7.4%; including thelinear effects of the frequencies as well dropped the test error to 6.6%. Itappears that the nonlinearities in the additive model have an additionalpredictive power.30350fˆ(internet)50fˆ(remove)5-5-5-50fˆ(over)0-5fˆ(our)510109.1 Generalized Additive Models02468012302over46046810internet00-5fˆ(hpl)-10-5-10-5fˆ(hp)50fˆ(business)50-5fˆ(free)2remove1010our0246810024605business1015200510hphpl-5fˆ(edu)-100fˆ(re)-10-50fˆ(1999)-50-5-10fˆ(george)5055free010203002460519991015200re51015edu0fˆ(CAPTOT)0-5-50fˆ(ch$)0-5-5fˆ(ch!)5fˆ(CAPMAX)5551010george01020ch!300123ch$456020006000CAPMAX10000050001000015000CAPTOTFIGURE 9.1. Spam analysis: estimated functions for significant predictors.
Therug plot along the bottom of each frame indicates the observed values of the corresponding predictor. For many of the predictors the nonlinearity picks up thediscontinuity at zero.3049. Additive Models, Trees, and Related MethodsIt is more serious to classify a genuine email message as spam, since thena good email would be filtered out and would not reach the user. We canalter the balance between the class error rates by changing the losses (seeSection 2.4). If we assign a loss L01 for predicting a true class 0 as class 1,and L10 for predicting a true class 1 as class 0, then the estimated Bayesrule predicts class 1 if its probability is greater than L01 /(L01 + L10 ).
Forexample, if we take L01 = 10, L10 = 1 then the (true) class 0 and class 1error rates change to 0.8% and 8.7%.More ambitiously, we can encourage the model to fit better data in theclass 0 by using weights L01 for the class 0 observations and L10 for theclass 1 observations. As above, we then use the estimated Bayes rule topredict. This gave error rates of 1.2% and 8.0% in (true) class 0 and class 1,respectively.
We discuss below the issue of unequal losses further, in thecontext of tree-based models.After fitting an additive model, one should check whether the inclusionof some interactions can significantly improve the fit. This can be done“manually,” by inserting products of some or all of the significant inputs,or automatically via the MARS procedure (Section 9.4).This example uses the additive model in an automatic fashion.
As a dataanalysis tool, additive models are often used in a more interactive fashion,adding and dropping terms to determine their effect. By calibrating theamount of smoothing in terms of df j , one can move seamlessly betweenlinear models (df j = 1) and partially linear models, where some terms aremodeled more flexibly. See Hastie and Tibshirani (1990) for more details.9.1.3 SummaryAdditive models provide a useful extension of linear models, making themmore flexible while still retaining much of their interpretability.
The familiartools for modeling and inference in linear models are also available foradditive models, seen for example in Table 9.2. The backfitting procedurefor fitting these models is simple and modular, allowing one to choose afitting method appropriate for each input variable. As a result they havebecome widely used in the statistical community.However additive models can have limitations for large data-mining applications. The backfitting algorithm fits all predictors, which is not feasible or desirable when a large number are available. The BRUTO procedure(Hastie and Tibshirani, 1990, Chapter 9) combines backfitting with selection of inputs, but is not designed for large data-mining problems.
Therehas also been recent work using lasso-type penalties to estimate sparse additive models, for example the COSSO procedure of Lin and Zhang (2006)and the SpAM proposal of Ravikumar et al. (2008). For large problems aforward stagewise approach such as boosting (Chapter 10) is more effective,and also allows for interactions to be included in the model.9.2 Tree-Based Methods3059.2 Tree-Based Methods9.2.1 BackgroundTree-based methods partition the feature space into a set of rectangles, andthen fit a simple model (like a constant) in each one. They are conceptuallysimple yet powerful.
We first describe a popular method for tree-basedregression and classification called CART, and later contrast it with C4.5,a major competitor.Let’s consider a regression problem with continuous response Y and inputs X1 and X2 , each taking values in the unit interval. The top left panelof Figure 9.2 shows a partition of the feature space by lines that are parallelto the coordinate axes. In each partition element we can model Y with adifferent constant. However, there is a problem: although each partitioningline has a simple description like X1 = c, some of the resulting regions arecomplicated to describe.To simplify matters, we restrict attention to recursive binary partitionslike that in the top right panel of Figure 9.2. We first split the space intotwo regions, and model the response by the mean of Y in each region.We choose the variable and split-point to achieve the best fit.
Then oneor both of these regions are split into two more regions, and this processis continued, until some stopping rule is applied. For example, in the topright panel of Figure 9.2, we first split at X1 = t1 . Then the region X1 ≤ t1is split at X2 = t2 and the region X1 > t1 is split at X1 = t3 . Finally, theregion X1 > t3 is split at X2 = t4 . The result of this process is a partitioninto the five regions R1 , R2 , . . . , R5 shown in the figure. The correspondingregression model predicts Y with a constant cm in region Rm , that is,fˆ(X) =5Xm=1cm I{(X1 , X2 ) ∈ Rm }.(9.9)This same model can be represented by the binary tree in the bottom leftpanel of Figure 9.2.