The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 66
Текст из файла (страница 66)
We describe five related techniques:generalized additive models, trees, multivariate adaptive regression splines,the patient rule induction method, and hierarchical mixtures of experts.9.1 Generalized Additive ModelsRegression models play an important role in many data analyses, providingprediction and classification rules, and data analytic tools for understanding the importance of different inputs.Although attractively simple, the traditional linear model often fails inthese situations: in real life, effects are often not linear.
In earlier chapterswe described techniques that used predefined basis functions to achievenonlinearities. This section describes more automatic flexible statisticalmethods that may be used to identify and characterize nonlinear regressioneffects. These methods are called “generalized additive models.”In the regression setting, a generalized additive model has the formE(Y |X1 , X2 , . . . , Xp ) = α + f1 (X1 ) + f2 (X2 ) + · · · + fp (Xp ).(9.1)2969. Additive Models, Trees, and Related MethodsAs usual X1 , X2 , .
. . , Xp represent predictors and Y is the outcome; the fj ’sare unspecified smooth (“nonparametric”) functions. If we were to modeleach function using an expansion of basis functions (as in Chapter 5), theresulting model could then be fit by simple least squares. Our approachhere is different: we fit each function using a scatterplot smoother (e.g., acubic smoothing spline or kernel smoother), and provide an algorithm forsimultaneously estimating all p functions (Section 9.1.1).For two-class classification, recall the logistic regression model for binarydata discussed in Section 4.4. We relate the mean of the binary responseµ(X) = Pr(Y = 1|X) to the predictors via a linear regression model andthe logit link function:µ(X)= α + β 1 X1 + · · · + βp Xp .(9.2)log1 − µ(X)The additive logistic regression model replaces each linear term by a moregeneral functional formµ(X)log= α + f1 (X1 ) + · · · + fp (Xp ),(9.3)1 − µ(X)where again each fj is an unspecified smooth function.
While the nonparametric form for the functions fj makes the model more flexible, theadditivity is retained and allows us to interpret the model in much thesame way as before. The additive logistic regression model is an exampleof a generalized additive model. In general, the conditional mean µ(X) ofa response Y is related to an additive function of the predictors via a linkfunction g:g[µ(X)] = α + f1 (X1 ) + · · · + fp (Xp ).(9.4)Examples of classical link functions are the following:• g(µ) = µ is the identity link, used for linear and additive models forGaussian response data.• g(µ) = logit(µ) as above, or g(µ) = probit(µ), the probit link function,for modeling binomial probabilities.
The probit function is the inverseGaussian cumulative distribution function: probit(µ) = Φ−1 (µ).• g(µ) = log(µ) for log-linear or log-additive models for Poisson countdata.All three of these arise from exponential family sampling models, whichin addition include the gamma and negative-binomial distributions. Thesefamilies generate the well-known class of generalized linear models, whichare all extended in the same way to generalized additive models.The functions fj are estimated in a flexible manner, using an algorithmwhose basic building block is a scatterplot smoother.
The estimated function fˆj can then reveal possible nonlinearities in the effect of Xj . Not all9.1 Generalized Additive Models297of the functions fj need to be nonlinear. We can easily mix in linear andother parametric forms with the nonlinear terms, a necessity when some ofthe inputs are qualitative variables (factors). The nonlinear terms are notrestricted to main effects either; we can have nonlinear components in twoor more variables, or separate curves in Xj for each level of the factor Xk .Thus each of the following would qualify:• g(µ) = X T β + αk + f (Z)—a semiparametric model, where X is avector of predictors to be modeled linearly, αk the effect for the kthlevel of a qualitative input V , and the effect of predictor Z is modelednonparametrically.• g(µ) = f (X) + gk (Z)—again k indexes the levels of a qualitativeinput V , and thus creates an interaction term g(V, Z) = gk (Z) forthe effect of V and Z.• g(µ) = f (X) + g(Z, W ) where g is a nonparametric function in twofeatures.Additive models can replace linear models in a wide variety of settings,for example an additive decomposition of time series,Y t = S t + T t + εt ,(9.5)where St is a seasonal component, Tt is a trend and ε is an error term.9.1.1 Fitting Additive ModelsIn this section we describe a modular algorithm for fitting additive modelsand their generalizations.
The building block is the scatterplot smootherfor fitting nonlinear effects in a flexible way. For concreteness we use as ourscatterplot smoother the cubic smoothing spline described in Chapter 5.The additive model has the formY =α+pXfj (Xj ) + ε,(9.6)j=1where the error term ε has mean zero. Given observations xi , yi , a criterionlike the penalized sum of squares (5.9) of Section 5.4 can be specified forthis problem,PRSS(α, f1 , f2 , .
. . , fp ) =NXi=1yi − α −pXj=1fj (xij )!2+pXj=1λjZ′′fj (tj )2 dtj ,(9.7)where the λj ≥ 0 are tuning parameters. It can be shown that the minimizerof (9.7) is an additive cubic spline model; each of the functions fj is a2989. Additive Models, Trees, and Related MethodsAlgorithm 9.1 The Backfitting Algorithm for Additive Models.PN1. Initialize: α̂ = N1 1 yi , fˆj ≡ 0, ∀i, j.2. Cycle: j = 1, 2, . .
. , p, . . . , 1, 2, . . . , p, . . . ,"#XNfˆj ← Sj {yi − α̂ −fˆk (xik )} ,1k6=jfˆjN1 Xˆˆfj (xij ).← fj −N i=1until the functions fˆj change less than a prespecified threshold.cubic spline in the component Xj , with knots at each of the unique valuesof xij , i = 1, . . . , N . However, without further restrictions on the model,the solution is not unique.
The constant α is not identifiable, since wecan add or subtract any constants to each of the functions fj , and adjustPNα accordingly. The standard convention is to assume that 1 fj (xij ) =0 ∀j—the functions average zero over the data. It is easily seen that α̂ =ave(yi ) in this case. If in addition to this restriction, the matrix of inputvalues (having ijth entry xij ) has full column rank, then (9.7) is a strictlyconvex criterion and the minimizer is unique. If the matrix is singular, thenthe linear part of the components fj cannot be uniquely determined (whilethe nonlinear parts can!)(Buja et al., 1989).Furthermore, a simple iterative procedure exists for finding the solution.We set α̂ = ave(yi ), and it never changes.
We apply a cubic smoothingPspline Sj to the targets {yi − α̂ − k6=j fˆk (xik )}N1 , as a function of xij ,ˆto obtain a new estimate fj . This is done for each predictor in turn, usingtheestimates of the other functions fˆk when computing yi − α̂ −P currentˆˆk6=j fk (xik ). The process is continued until the estimates fj stabilize. Thisprocedure, given in detail in Algorithm 9.1, is known as “backfitting” andthe resulting fit is analogous to a multiple regression for linear models.In principle, the second step in (2) of Algorithm 9.1 is not needed, sincethe smoothing spline fit to a mean-zero response has mean zero (Exercise 9.1). In practice, machine rounding can cause slippage, and the adjustment is advised.This same algorithm can accommodate other fitting methods in exactlythe same way, by specifying appropriate smoothing operators Sj :• other univariate regression smoothers such as local polynomial regression and kernel methods;9.1 Generalized Additive Models299• linear regression operators yielding polynomial fits, piecewise constant fits, parametric spline fits, series and Fourier fits;• more complicated operators such as surface smoothers for second orhigher-order interactions or periodic smoothers for seasonal effects.If we consider the operation of smoother Sj only at the training points, itcan be represented by an N × N operator matrix Sj (see Section 5.4.1).Then the degrees of freedom for the jth term are (approximately) computedas df j = trace[Sj ] − 1, by analogy with degrees of freedom for smoothersdiscussed in Chapters 5 and 6.For a large class of linear smoothers Sj , backfitting is equivalent to aGauss–Seidel algorithm for solving a certain linear system of equations.Details are given in Exercise 9.2.For the logistic regression model and other generalized additive models,the appropriate criterion is a penalized log-likelihood.
To maximize it, thebackfitting procedure is used in conjunction with a likelihood maximizer.The usual Newton–Raphson routine for maximizing log-likelihoods in generalized linear models can be recast as an IRLS (iteratively reweightedleast squares) algorithm. This involves repeatedly fitting a weighted linearregression of a working response variable on the covariates; each regressionyields a new value of the parameter estimates, which in turn give new working responses and weights, and the process is iterated (see Section 4.4.1).In the generalized additive model, the weighted linear regression is simplyreplaced by a weighted backfitting algorithm.
We describe the algorithm inmore detail for logistic regression below, and more generally in Chapter 6of Hastie and Tibshirani (1990).9.1.2 Example: Additive Logistic RegressionProbably the most widely used model in medical research is the logisticmodel for binary data. In this model the outcome Y can be coded as 0or 1, with 1 indicating an event (like death or relapse of a disease) and0 indicating no event.
We wish to model Pr(Y = 1|X), the probability ofan event given values of the prognostic factors X T = (X1 , . . . , Xp ). Thegoal is usually to understand the roles of the prognostic factors, ratherthan to classify new individuals. Logistic models are also used in applications where one is interested in estimating the class probabilities, for usein risk screening. Apart from medical applications, credit risk screening isa popular application.The generalized additive logistic model has the formlogPr(Y = 1|X)= α + f1 (X1 ) + · · · + fp (Xp ).Pr(Y = 0|X)(9.8)The functions f1 , f2 , .
. . , fp are estimated by a backfitting algorithmwithin a Newton–Raphson procedure, shown in Algorithm 9.2.3009. Additive Models, Trees, and Related MethodsAlgorithm 9.2 Local Scoring Algorithm for the Additive Logistic Regression Model.1. Compute starting values: α̂ = log[ȳ/(1 − ȳ)], where ȳ = ave(yi ), thesample proportion of ones, and set fˆj ≡ 0 ∀j.P2. Define η̂i = α̂ + j fˆj (xij ) and p̂i = 1/[1 + exp(−η̂i )].Iterate:(a) Construct the working target variablezi = η̂i +(yi − p̂i ).p̂i (1 − p̂i )(b) Construct weights wi = p̂i (1 − p̂i )(c) Fit an additive model to the targets zi with weights wi , using a weighted backfitting algorithm.
This gives new estimatesα̂, fˆj , ∀j3. Continue step 2. until the change in the functions falls below a prespecified threshold.The additive model fitting in step (2) of Algorithm 9.2 requires a weightedscatterplot smoother. Most smoothing procedures can accept observationweights (Exercise 5.12); see Chapter 3 of Hastie and Tibshirani (1990) forfurther details.The additive logistic regression model can be generalized further to handle more than two classes, using the multilogit formulation as outlined inSection 4.4.
While the formulation is a straightforward extension of (9.8),the algorithms for fitting such models are more complex. See Yee and Wild(1996) for details, and the VGAM software currently available from:http://www.stat.auckland.ac.nz/∼yee.Example: Predicting Email SpamWe apply a generalized additive model to the spam data introduced inChapter 1. The data consists of information from 4601 email messages, ina study to screen email for “spam” (i.e., junk email). The data is publiclyavailable at ftp.ics.uci.edu, and was donated by George Forman fromHewlett-Packard laboratories, Palo Alto, California.The response variable is binary, with values email or spam, and there are57 predictors as described below:• 48 quantitative predictors—the percentage of words in the email thatmatch a given word.