The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 72
Текст из файла (страница 72)
(When the model is restricted to be additive—details below—a penalty of c = 2 is used). Using this, we choose the model along thebackward sequence that minimizes GCV(λ).Why these piecewise linear basis functions, and why this particular modelstrategy? A key property of the functions of Figure 9.9 is their ability tooperate locally; they are zero over part of their range.
When they are multiplied together, as in Figure 9.11, the result is nonzero only over the smallpart of the feature space where both component functions are nonzero. Asa result, the regression surface is built up parsimoniously, using nonzerocomponents locally—only where they are needed. This is important, sinceone should “spend” parameters carefully in high dimensions, as they canrun out quickly. The use of other basis functions such as polynomials, wouldproduce a nonzero product everywhere, and would not work as well.The second important advantage of the piecewise linear basis functionconcerns computation.
Consider the product of a function in M with eachof the N reflected pairs for an input Xj . This appears to require the fittingof N single-input linear regression models, each of which uses O(N ) operations, making a total of O(N 2 ) operations.
However, we can exploit thesimple form of the piecewise linear function. We first fit the reflected pairwith rightmost knot. As the knot is moved successively one position at atime to the left, the basis functions differ by zero over the left part of thedomain, and by a constant over the right part. Hence after each such movewe can update the fit in O(1) operations. This allows us to try every knotin only O(N ) operations.The forward modeling strategy in MARS is hierarchical, in the sense thatmultiway products are built up from products involving terms already inthe model.
For example, a four-way product can only be added to the modelif one of its three-way components is already in the model. The philosophyhere is that a high-order interaction will likely only exist if some of its lowerorder “footprints” exist as well. This need not be true, but is a reasonableworking assumption and avoids the search over an exponentially growingspace of alternatives.9. Additive Models, Trees, and Related Methods0.20.3••••0.1Test Misclassification Error0.4326GCV choice••••••••• • •• • •••••••••••••••••••••••• •• •• ••••••• •• • •• •••••••••••••••••••••••••••••••••••••••••••0.055020406080100Rank of ModelFIGURE 9.12. Spam data: test error misclassification rate for the MARS procedure, as a function of the rank (number of independent basis functions) in themodel.There is one restriction put on the formation of model terms: each inputcan appear at most once in a product.
This prevents the formation ofhigher-order powers of an input, which increase or decrease too sharplynear the boundaries of the feature space. Such powers can be approximatedin a more stable way with piecewise linear functions.A useful option in the MARS procedure is to set an upper limit onthe order of interaction. For example, one can set a limit of two, allowingpairwise products of piecewise linear functions, but not three- or higherway products.
This can aid in the interpretation of the final model. Anupper limit of one results in an additive model.9.4.1 Spam Example (Continued)We applied MARS to the “spam” data analyzed earlier in this chapter. Toenhance interpretability, we restricted MARS to second-degree interactions.Although the target is a two-class variable, we used the squared-error lossfunction nonetheless (see Section 9.4.3). Figure 9.12 shows the test errormisclassification rate as a function of the rank (number of independent basis functions) in the model. The error rate levels off at about 5.5%, which isslightly higher than that of the generalized additive model (5.3%) discussedearlier. GCV chose a model size of 60, which is roughly the smallest modelgiving optimal performance.
The leading interactions found by MARS involved inputs (ch$, remove), (ch$, free) and (hp, CAPTOT). However, theseinteractions give no improvement in performance over the generalized additive model.9.4 MARS: Multivariate Adaptive Regression Splines3279.4.2 Example (Simulated Data)Here we examine the performance of MARS in three contrasting scenarios.There are N = 100 observations, and the predictors X1 , X2 , . . . , Xp anderrors ε have independent standard normal distributions.Scenario 1: The data generation model isY = (X1 − 1)+ + (X1 − 1)+ · (X2 − .8)+ + 0.12 · ε.(9.21)The noise standard deviation 0.12 was chosen so that the signal-tonoise ratio was about 5. We call this the tensor-product scenario; theproduct term gives a surface that looks like that of Figure 9.11.Scenario 2: This is the same as scenario 1, but with p = 20 total predictors;that is, there are 18 inputs that are independent of the response.Scenario 3: This has the structure of a neural network:ℓ1ℓ2σ(t)Y====X1 + X2 + X3 + X4 + X5 ,X6 − X7 + X8 − X9 + X10 ,1/(1 + e−t ),σ(ℓ1 ) + σ(ℓ2 ) + 0.12 · ε.(9.22)Scenarios 1 and 2 are ideally suited for MARS, while scenario 3 containshigh-order interactions and may be difficult for MARS to approximate.
Weran five simulations from each model, and recorded the results.In scenario 1, MARS typically uncovered the correct model almost perfectly. In scenario 2, it found the correct structure but also found a fewextraneous terms involving other predictors.Let µ(x) be the true mean of Y , and letMSE0MSE==avex∈Test (ȳ − µ(x))2 ,avex∈Test (fˆ(x) − µ(x))2 .(9.23)These represent the mean-square error of the constant model and the fittedMARS model, estimated by averaging at the 1000 test values of x.
Table 9.4shows the proportional decrease in model error or R2 for each scenario:R2 =MSE0 − MSE.MSE0(9.24)The values shown are means and standard error over the five simulations.The performance of MARS is degraded only slightly by the inclusion of theuseless inputs in scenario 2; it performs substantially worse in scenario 3.3289. Additive Models, Trees, and Related MethodsTABLE 9.4. Proportional decrease in model error (R2 ) when MARS is appliedto three different scenarios.Scenario1: Tensor product p = 22: Tensor product p = 203: Neural networkMean0.970.960.79(S.E.)(0.01)(0.01)(0.01)9.4.3 Other IssuesMARS for ClassificationThe MARS method and algorithm can be extended to handle classificationproblems.
Several strategies have been suggested.For two classes, one can code the output as 0/1 and treat the problem asa regression; we did this for the spam example. For more than two classes,one can use the indicator response approach described in Section 4.2. Onecodes the K response classes via 0/1 indicator variables, and then performs a multi-response MARS regression.
For the latter we use a commonset of basis functions for all response variables. Classification is made tothe class with the largest predicted response value. There are, however, potential masking problems with this approach, as described in Section 4.2.A generally superior approach is the “optimal scoring” method discussedin Section 12.5.Stone et al. (1997) developed a hybrid of MARS called PolyMARS specifically designed to handle classification problems. It uses the multiple logisticframework described in Section 4.4. It grows the model in a forward stagewise fashion like MARS, but at each stage uses a quadratic approximationto the multinomial log-likelihood to search for the next basis-function pair.Once found, the enlarged model is fit by maximum likelihood, and theprocess is repeated.Relationship of MARS to CARTAlthough they might seem quite different, the MARS and CART strategiesactually have strong similarities.
Suppose we take the MARS procedure andmake the following changes:• Replace the piecewise linear basis functions by step functions I(x−t >0) and I(x − t ≤ 0).• When a model term is involved in a multiplication by a candidateterm, it gets replaced by the interaction, and hence is not availablefor further interactions.With these changes, the MARS forward procedure is the same as the CARTtree-growing algorithm.
Multiplying a step function by a pair of reflected9.5 Hierarchical Mixtures of Experts329step functions is equivalent to splitting a node at the step. The secondrestriction implies that a node may not be split more than once, and leadsto the attractive binary-tree representation of the CART model. On theother hand, it is this restriction that makes it difficult for CART to modeladditive structures. MARS forgoes the tree structure and gains the abilityto capture additive effects.Mixed InputsMars can handle “mixed” predictors—quantitative and qualitative—in anatural way, much like CART does. MARS considers all possible binarypartitions of the categories for a qualitative predictor into two groups.Each such partition generates a pair of piecewise constant basis functions—indicator functions for the two sets of categories.
This basis pair is nowtreated as any other, and is used in forming tensor products with otherbasis functions already in the model.9.5 Hierarchical Mixtures of ExpertsThe hierarchical mixtures of experts (HME) procedure can be viewed as avariant of tree-based methods. The main difference is that the tree splitsare not hard decisions but rather soft probabilistic ones. At each node anobservation goes left or right with probabilities depending on its input values.