The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 71
Текст из файла (страница 71)
Illustration of PRIM algorithm. There are two classes, indicatedby the blue (class 0) and red (class 1) points. The procedure starts with a rectangle(broken black lines) surrounding all of the data, and then peels away points alongone edge by a prespecified amount in order to maximize the mean of the pointsremaining in the box.
Starting at the top left panel, the sequence of peelings isshown, until a pure red region is isolated in the bottom right panel. The iterationnumber is indicated at the top of each panel.•••••••••••••••••• • • ••50••100•••150Number of Observations in BoxFIGURE 9.8. Box mean as a function of number of observations in the box.3209. Additive Models, Trees, and Related MethodsAlgorithm 9.3 Patient Rule Induction Method.1.
Start with all of the training data, and a maximal box containing allof the data.2. Consider shrinking the box by compressing one face, so as to peel offthe proportion α of observations having either the highest values ofa predictor Xj , or the lowest. Choose the peeling that produces thehighest response mean in the remaining box. (Typically α = 0.05 or0.10.)3. Repeat step 2 until some minimal number of observations (say 10)remain in the box.4.
Expand the box along any face, as long as the resulting box meanincreases.5. Steps 1–4 give a sequence of boxes, with different numbers of observations in each box. Use cross-validation to choose a member of thesequence. Call the box B1 .6. Remove the data in box B1 from the dataset and repeat steps 2–5 toobtain a second box, and continue to get as many boxes as desired.variable); a two-class outcome can be handled simply by coding it as 0 and1. There is no simple way to deal with k > 2 classes simultaneously: oneapproach is to run PRIM separately for each class versus a baseline class.An advantage of PRIM over CART is its patience. Because of its binary splits, CART fragments the data quite quickly.
Assuming splits ofequal size, with N observations it can only make log2 (N ) − 1 splits beforerunning out of data. If PRIM peels off a proportion α of training pointsat each stage, it can perform approximately − log(N )/ log(1 − α) peelingsteps before running out of data. For example, if N = 128 and α = 0.10,then log2 (N ) − 1 = 6 while − log(N )/ log(1 − α) ≈ 46. Taking into accountthat there must be an integer number of observations at each stage, PRIMin fact can peel only 29 times. In any case, the ability of PRIM to be morepatient should help the top-down greedy algorithm find a better solution.9.3.1 Spam Example (Continued)We applied PRIM to the spam data, with the response coded as 1 for spamand 0 for email.The first two boxes found by PRIM are summarized below:9.4 MARS: Multivariate Adaptive Regression SplinesRule 1TrainingTestRule 2TrainingTest321Global Mean Box Mean Box Support0.39310.96070.14130.39581.00000.1536ch! > 0.029CAPAVE> 2.331your>0.7051999 < 0.040Rule 1CAPTOT > 79.50edu < 0.070re < 0.535ch; < 0.030Remain Mean0.29980.2862Rule 2Box Mean0.95600.9264removegeorge><Box Support0.10430.10610.0100.110The box support is the proportion of observations falling in the box.The first box is purely spam, and contains about 15% of the test data.The second box contains 10.6% of the test observations, 92.6% of whichare spam.
Together the two boxes contain 26% of the data and are about97% spam. The next few boxes (not shown) are quite small, containing onlyabout 3% of the data.The predictors are listed in order of importance. Interestingly the topsplitting variables in the CART tree (Figure 9.5) do not appear in PRIM’sfirst box.9.4 MARS: Multivariate Adaptive RegressionSplinesMARS is an adaptive procedure for regression, and is well suited for highdimensional problems (i.e., a large number of inputs). It can be viewed as ageneralization of stepwise linear regression or a modification of the CARTmethod to improve the latter’s performance in the regression setting.
Weintroduce MARS from the first point of view, and later make the connectionto CART.MARS uses expansions in piecewise linear basis functions of the form(x − t)+ and (t − x)+ . The “+” means positive part, sox − t, if x > t,t − x, if x < t,(x−t)+ =and (t−x)+ =0,otherwise,0,otherwise.0.0 0.1 0.2 0.3 0.4 0.59. Additive Models, Trees, and Related MethodsBasis Function322(t − x)+0.00.20.4(x − t)+tx0.60.81.0FIGURE 9.9. The basis functions (x − t)+ (solid orange) and (t − x)+ (brokenblue) used by MARS.As an example, the functions (x − 0.5)+ and (0.5 − x)+ are shown in Figure 9.9.Each function is piecewise linear, with a knot at the value t. In theterminology of Chapter 5, these are linear splines. We call the two functionsa reflected pair in the discussion below. The idea is to form reflected pairsfor each input Xj with knots at each observed value xij of that input.Therefore, the collection of basis functions isC = {(Xj − t)+ , (t − Xj )+ }t ∈ {x1j , x2j , .
. . , xN j }j = 1, 2, . . . , p.(9.18)If all of the input values are distinct, there are 2N p basis functions altogether. Note that although each basis function depends only on a singleXj , for example, h(X) = (Xj − t)+ , it is considered as a function over theentire input space IRp .The model-building strategy is like a forward stepwise linear regression,but instead of using the original inputs, we are allowed to use functionsfrom the set C and their products. Thus the model has the formf (X) = β0 +MXβm hm (X),(9.19)m=1where each hm (X) is a function in C, or a product of two or more suchfunctions.Given a choice for the hm , the coefficients βm are estimated by minimizing the residual sum-of-squares, that is, by standard linear regression.
Thereal art, however, is in the construction of the functions hm (x). We startwith only the constant function h0 (X) = 1 in our model, and all functionsin the set C are candidate functions. This is depicted in Figure 9.10.At each stage we consider as a new basis function pair all products of afunction hm in the model set M with one of the reflected pairs in C. Weadd to the model M the term of the formβ̂M +1 hℓ (X) · (Xj − t)+ + β̂M +2 hℓ (X) · (t − Xj )+ , hℓ ∈ M,9.4 MARS: Multivariate Adaptive Regression Splines323X1ConstantX2XpX1X2X2XpX1X2X1X2XpFIGURE 9.10.
Schematic of the MARS forward model-building procedure. Onthe left are the basis functions currently in the model: initially, this is the constantfunction h(X) = 1. On the right are all candidate basis functions to be consideredin building the model. These are pairs of piecewise linear basis functions as inFigure 9.9, with knots t at all unique observed values xij of each predictor Xj .At each stage we consider all products of a candidate pair with a basis functionin the model. The product that decreases the residual error the most is added intothe current model. Above we illustrate the first three steps of the procedure, withthe selected functions shown in red.3249.
Additive Models, Trees, and Related Methodsh(X1 , X2 )X2X1FIGURE 9.11. The function h(X1 , X2 ) = (X1 − x51 )+ · (x72 − X2 )+ , resultingfrom multiplication of two piecewise linear MARS basis functions.that produces the largest decrease in training error. Here β̂M +1 and β̂M +2are coefficients estimated by least squares, along with all the other M + 1coefficients in the model.
Then the winning products are added to themodel and the process is continued until the model set M contains somepreset maximum number of terms.For example, at the first stage we consider adding to the model a functionof the form β1 (Xj − t)+ + β2 (t − Xj )+ ; t ∈ {xij }, since multiplication bythe constant function just produces the function itself. Suppose the bestchoice is β̂1 (X2 − x72 )+ + β̂2 (x72 − X2 )+ . Then this pair of basis functionsis added to the set M, and at the next stage we consider including a pairof products the formhm (X) · (Xj − t)+hm (X) · (t − Xj )+ , t ∈ {xij },andwhere for hm we have the choicesh0 (X)= 1,h1 (X)h2 (X)= (X2 − x72 )+ , or= (x72 − X2 )+ .The third choice produces functions such as (X1 − x51 )+ · (x72 − X2 )+ ,depicted in Figure 9.11.At the end of this process we have a large model of the form (9.19).
Thismodel typically overfits the data, and so a backward deletion procedureis applied. The term whose removal causes the smallest increase in residual squared error is deleted from the model at each stage, producing anestimated best model fˆλ of each size (number of terms) λ. One could usecross-validation to estimate the optimal value of λ, but for computational9.4 MARS: Multivariate Adaptive Regression Splines325savings the MARS procedure instead uses generalized cross-validation.
Thiscriterion is defined asGCV(λ) =PN− fˆλ (xi ))2.(1 − M (λ)/N )2i=1 (yi(9.20)The value M (λ) is the effective number of parameters in the model: thisaccounts both for the number of terms in the models, plus the numberof parameters used in selecting the optimal positions of the knots. Somemathematical and simulation results suggest that one should pay a priceof three parameters for selecting a knot in a piecewise linear regression.Thus if there are r linearly independent basis functions in the model, andK knots were selected in the forward process, the formula is M (λ) = r+cK,where c = 3.