The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 57
Текст из файла (страница 57)
Often a “one-standarderror” rule is used with cross-validation, in which we choose the most parsimonious model whose error is no more than one standard error abovethe error of the best model. Here it looks like a model with about p = 9predictors would be chosen, while the true model uses p = 10.Generalized cross-validation provides a convenient approximation to leaveone out cross-validation, for linear fitting under squared-error loss. As defined in Section 7.6, a linear fitting method is one for which we can writeŷ = Sy.(7.50)Now for many linear fitting methods,N1 X[yi − fˆ−i (xi )]2N i=1=N1 Xh yi − fˆ(xi ) i2,N i=1 1 − Sii(7.51)where Sii is the ith diagonal element of S (see Exercise 7.3).
The GCVapproximation is#2"NXˆ(xi )y−f1i.(7.52)GCV(fˆ) =N i=1 1 − trace(S)/N7.10 Cross-Validation245The quantity trace(S) is the effective number of parameters, as defined inSection 7.6.GCV can have a computational advantage in some settings, where thetrace of S can be computed more easily than the individual elements Sii .In smoothing problems, GCV can also alleviate the tendency of crossvalidation to undersmooth. The similarity between GCV and AIC can beseen from the approximation 1/(1 − x)2 ≈ 1 + 2x (Exercise 7.7).7.10.2 The Wrong and Right Way to Do Cross-validationConsider a classification problem with a large number of predictors, as mayarise, for example, in genomic or proteomic applications.
A typical strategyfor analysis might be as follows:1. Screen the predictors: find a subset of “good” predictors that showfairly strong (univariate) correlation with the class labels2. Using just this subset of predictors, build a multivariate classifier.3. Use cross-validation to estimate the unknown tuning parameters andto estimate the prediction error of the final model.Is this a correct application of cross-validation? Consider a scenario withN = 50 samples in two equal-sized classes, and p = 5000 quantitativepredictors (standard Gaussian) that are independent of the class labels.The true (test) error rate of any classifier is 50%. We carried out the aboverecipe, choosing in step (1) the 100 predictors having highest correlationwith the class labels, and then using a 1-nearest neighbor classifier, basedon just these 100 predictors, in step (2). Over 50 simulations from thissetting, the average CV error rate was 3%.
This is far lower than the trueerror rate of 50%.What has happened? The problem is that the predictors have an unfairadvantage, as they were chosen in step (1) on the basis of all of the samples.Leaving samples out after the variables have been selected does not correctly mimic the application of the classifier to a completely independenttest set, since these predictors “have already seen” the left out samples.Figure 7.10 (top panel) illustrates the problem. We selected the 100 predictors having largest correlation with the class labels over all 50 samples.Then we chose a random set of 10 samples, as we would do in five-fold crossvalidation, and computed the correlations of the pre-selected 100 predictorswith the class labels over just these 10 samples (top panel).
We see thatthe correlations average about 0.28, rather than 0, as one might expect.Here is the correct way to carry out cross-validation in this example:1. Divide the samples into K cross-validation folds (groups) at random.2. For each fold k = 1, 2, . . . , K2467. Model Assessment and Selection20100Frequency30Wrong way−1.0−0.50.00.51.0Correlations of Selected Predictors with Outcome20100Frequency30Right way−1.0−0.50.00.51.0Correlations of Selected Predictors with OutcomeFIGURE 7.10. Cross-validation the wrong and right way: histograms shows thecorrelation of class labels, in 10 randomly chosen samples, with the 100 predictors chosen using the incorrect (upper red) and correct (lower green) versions ofcross-validation.(a) Find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels, using all of the samplesexcept those in fold k.(b) Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in fold k.(c) Use the classifier to predict the class labels for the samples infold k.The error estimates from step 2(c) are then accumulated over all K folds, toproduce the cross-validation estimate of prediction error.
The lower panelof Figure 7.10 shows the correlations of class labels with the 100 predictorschosen in step 2(a) of the correct procedure, over the samples in a typicalfold k. We see that they average about zero, as they should.In general, with a multistep modeling procedure, cross-validation mustbe applied to the entire sequence of modeling steps. In particular, samplesmust be “left out” before any selection or filtering steps are applied.
Thereis one qualification: initial unsupervised screening steps can be done before samples are left out. For example, we could select the 1000 predictors7.10 Cross-Validation247with highest variance across all 50 samples, before starting cross-validation.Since this filtering does not involve the class labels, it does not give thepredictors an unfair advantage.While this point may seem obvious to the reader, we have seen thisblunder committed many times in published papers in top rank journals.With the large numbers of predictors that are so common in genomic andother areas, the potential consequences of this error have also increaseddramatically; see Ambroise and McLachlan (2002) for a detailed discussionof this issue.7.10.3 Does Cross-Validation Really Work?We once again examine the behavior of cross-validation in a high-dimensionalclassification problem.
Consider a scenario with N = 20 samples in twoequal-sized classes, and p = 500 quantitative predictors that are independent of the class labels. Once again, the true error rate of any classifier is50%. Consider a simple univariate classifier: a single split that minimizesthe misclassification error (a “stump”). Stumps are trees with a single split,and are used in boosting methods (Chapter 10). A simple argument suggests that cross-validation will not work properly in this setting2 :Fitting to the entire training set, we will find a predictor thatsplits the data very well. If we do 5-fold cross-validation, thissame predictor should split any 4/5ths and 1/5th of the datawell too, and hence its cross-validation error will be small (muchless than 50%.) Thus CV does not give an accurate estimate oferror.To investigate whether this argument is correct, Figure 7.11 shows theresult of a simulation from this setting.
There are 500 predictors and 20samples, in each of two equal-sized classes, with all predictors having astandard Gaussian distribution. The panel in the top left shows the numberof training errors for each of the 500 stumps fit to the training data. Wehave marked in color the six predictors yielding the fewest errors. In the topright panel, the training errors are shown for stumps fit to a random 4/5thspartition of the data (16 samples), and tested on the remaining 1/5th (foursamples). The colored points indicate the same predictors marked in thetop left panel.
We see that the stump for the blue predictor (whose stumpwas the best in the top left panel), makes two out of four test errors (50%),and is no better than random.What has happened? The preceding argument has ignored the fact thatin cross-validation, the model must be completely retrained for each fold2 This argument was made to us by a scientist at a proteomics lab meeting, and ledto material in this section.7. Model Assessment and Selection312Error on 1/57654203Error on Full Training Set849248010020030040050012345678Error on 4/50.6Class Label0.811.0Predictor00.00.20.4full4/5−10Predictor 436 (blue)12CV ErrorsFIGURE 7.11. Simulation study to investigate the performance of cross validation in a high-dimensional problem where the predictors are independent of theclass labels.
The top-left panel shows the number of errors made by individualstump classifiers on the full training set (20 observations). The top right panelshows the errors made by individual stumps trained on a random split of thedataset into 4/5ths (16 observations) and tested on the remaining 1/5th (4 observations). The best performers are depicted by colored dots in each panel. Thebottom left panel shows the effect of re-estimating the split point in each fold: thecolored points correspond to the four samples in the 4/5ths validation set. Thesplit point derived from the full dataset classifies all four samples correctly, butwhen the split point is re-estimated on the 4/5ths data (as it should be), it commits two errors on the four validation samples.
In the bottom right we see theoverall result of five-fold cross-validation applied to 50 simulated datasets. Theaverage error rate is about 50%, as it should be.7.11 Bootstrap Methods249of the process. In the present example, this means that the best predictorand corresponding split point are found from 4/5ths of the data. The effectof predictor choice is seen in the top right panel. Since the class labels areindependent of the predictors, the performance of a stump on the 4/5thstraining data contains no information about its performance in the remaining 1/5th.