The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 58
Текст из файла (страница 58)
The effect of the choice of split point is shown in the bottom leftpanel. Here we see the data for predictor 436, corresponding to the bluedot in the top left plot. The colored points indicate the 1/5th data, whilethe remaining points belong to the 4/5ths. The optimal split points for thispredictor based on both the full training set and 4/5ths data are indicated.The split based on the full data makes no errors on the 1/5ths data. Butcross-validation must base its split on the 4/5ths data, and this incurs twoerrors out of four samples.The results of applying five-fold cross-validation to each of 50 simulateddatasets is shown in the bottom right panel. As we would hope, the averagecross-validation error is around 50%, which is the true expected predictionerror for this classifier.
Hence cross-validation has behaved as it should.On the other hand, there is considerable variability in the error, underscoring the importance of reporting the estimated standard error of the CVestimate. See Exercise 7.10 for another variation of this problem.7.11 Bootstrap MethodsThe bootstrap is a general tool for assessing statistical accuracy. First wedescribe the bootstrap in general, and then show how it can be used toestimate extra-sample prediction error. As with cross-validation, the bootstrap seeks to estimate the conditional error ErrT , but typically estimateswell only the expected prediction error Err.Suppose we have a model fit to a set of training data.
We denote thetraining set by Z = (z1 , z2 , . . . , zN ) where zi = (xi , yi ). The basic idea isto randomly draw datasets with replacement from the training data, eachsample the same size as the original training set. This is done B times(B = 100 say), producing B bootstrap datasets, as shown in Figure 7.12.Then we refit the model to each of the bootstrap datasets, and examinethe behavior of the fits over the B replications.In the figure, S(Z) is any quantity computed from the data Z, for example, the prediction at some input point. From the bootstrap samplingwe can estimate any aspect of the distribution of S(Z), for example, itsvariance,BdVar[S(Z)]=1 X(S(Z∗b ) − S̄ ∗ )2 ,B−1b=1(7.53)2507.
Model Assessment and SelectionBootstrapreplicationsS(Z∗1 )S(Z∗2 )S(Z∗B )BootstrapsamplesZ∗1Z∗BZ∗2Z = (z1 , z2 , . . . , zN )TrainingsampleFIGURE 7.12. Schematic of the bootstrap process. We wish to assess the statistical accuracy of a quantity S(Z) computed from our dataset. B training setsZ∗b , b = 1, . . . , B each of size N are drawn with replacement from the originaldataset. The quantity of interest S(Z) is computed from each bootstrap trainingset, and the values S(Z∗1 ), .
. . , S(Z∗B ) are used to assess the statistical accuracyof S(Z).P∗bdwhere S̄ ∗ =b S(Z )/B. Note that Var[S(Z)] can be thought of as aMonte-Carlo estimate of the variance of S(Z) under sampling from theempirical distribution function F̂ for the data (z1 , z2 , . . . , zN ).How can we apply the bootstrap to estimate prediction error? One approach would be to fit the model in question on a set of bootstrap samples,and then keep track of how well it predicts the original training set.
Iffˆ∗b (xi ) is the predicted value at xi , from the model fitted to the bth bootstrap dataset, our estimate isNB XXdboot = 1 1ErrL(yi , fˆ∗b (xi )).BNi=1(7.54)b=1dboot does not provide a good estimate inHowever, it is easy to see that Errgeneral. The reason is that the bootstrap datasets are acting as the trainingsamples, while the original training set is acting as the test sample, andthese two samples have observations in common. This overlap can makeoverfit predictions look unrealistically good, and is the reason that crossvalidation explicitly uses non-overlapping data for the training and testsamples. Consider for example a 1-nearest neighbor classifier applied to atwo-class classification problem with the same number of observations in7.11 Bootstrap Methods251each class, in which the predictors and class labels are in fact independent.Then the true error rate is 0.5.
But the contributions to the bootstrapdboot will be zero unless the observation i does not appear in theestimate Errbootstrap sample b. In this latter case it will have the correct expectation0.5. Now1 NPr{observation i ∈ bootstrap sample b} = 1 − 1 −N≈ 1 − e−1=0.632.(7.55)dboot is about 0.5 × 0.368 = 0.184, far belowHence the expectation of Errthe correct error rate 0.5.By mimicking cross-validation, a better bootstrap estimate can be obtained. For each observation, we only keep track of predictions from bootstrap samples not containing that observation. The leave-one-out bootstrapestimate of prediction error is defined bydErr(1)=NX1 X 1L(yi , fˆ∗b (xi )).−iN i=1 |C |−i(7.56)b∈CHere C −i is the set of indices of the bootstrap samples b that do not contain(1)d ,observation i, and |C −i | is the number of such samples.
In computing Errwe either have to choose B large enough to ensure that all of the |C −i | aregreater than zero, or we can just leave out the terms in (7.56) correspondingto |C −i |’s that are zero.The leave-one out bootstrap solves the overfitting problem suffered bydboot , but has the training-set-size bias mentioned in the discussion ofErrcross-validation. The average number of distinct observations in each bootstrap sample is about 0.632 · N , so its bias will roughly behave like that oftwofold cross-validation.
Thus if the learning curve has considerable slopeat sample size N/2, the leave-one out bootstrap will be biased upward asan estimate of the true error.The “.632 estimator” is designed to alleviate this bias. It is defined bydErr(.632)d= .368 · err + .632 · Err(1).(7.57)The derivation of the .632 estimator is complex; intuitively it pulls theleave-one out bootstrap estimate down toward the training error rate, andhence reduces its upward bias. The use of the constant .632 relates to (7.55).The .632 estimator works well in “light fitting” situations, but can breakdown in overfit ones.
Here is an example due to Breiman et al. (1984).Suppose we have two equal-size classes, with the targets independent ofthe class labels, and we apply a one-nearest neighbor rule. Then err = 0,2527. Model Assessment and Selection(1)(.632)d = 0.5 and so ErrdErr= .632 × 0.5 = .316. However, the true errorrate is 0.5.One can improve the .632 estimator by taking into account the amountof overfitting. First we define γ to be the no-information error rate: thisis the error rate of our prediction rule if the inputs and class labels wereindependent.
An estimate of γ is obtained by evaluating the prediction ruleon all possible combinations of targets yi and predictors xi′γ̂ =N N1 XXL(yi , fˆ(xi′ )).N 2 i=1 ′(7.58)i =1For example, consider the dichotomous classification problem: let p̂1 bethe observed proportion of responses yi equaling 1, and let q̂1 be the observed proportion of predictions fˆ(xi′ ) equaling 1. Thenγ̂ = p̂1 (1 − q̂1 ) + (1 − p̂1 )q̂1 .(7.59)With a rule like 1-nearest neighbors for which q̂1 = p̂1 the Pvalue of γ̂ is2p̂1 (1− p̂1 ). The multi-category generalization of (7.59) is γ̂ = ℓ p̂ℓ (1− q̂ℓ ).Using this, the relative overfitting rate is defined to be(1)d − errErrR̂ =,γ̂ − err(7.60)(1)d = err) to 1a quantity that ranges from 0 if there is no overfitting (Errif the overfitting equals the no-information value γ̂ − err. Finally, we definethe “.632+” estimator bydErr(.632+)with ŵ==d(1 − ŵ) · err + ŵ · Err.632.1 − .368R̂(1)(7.61)(.632+)dThe weight w ranges from .632 if R̂ = 0 to 1 if R̂ = 1, so Err(.632)(1)dd .
Again, the derivation of (7.61) is compliranges from Errto Errcated: roughly speaking, it produces a compromise between the leave-oneout bootstrap and the training error rate that depends on the amount ofoverfitting. For the 1-nearest-neighbor problem with class labels indepen(.632+)(1)dd , which has the correctdent of the inputs, ŵ = R̂ = 1, so Err= Errdexpectation of 0.5.
In other problems with less overfitting, Err(1)d .lie somewhere between err and Err(.632+)will7.11.1 Example (Continued)Figure 7.13 shows the results of tenfold cross-validation and the .632+ bootstrap estimate in the same four problems of Figures 7.7. As in that figure,7.11 Bootstrap Methods253806040200% Increase Over Best100Cross−validationreg/KNNreg/linearclass/KNNclass/linearclass/KNNclass/linear806040200% Increase Over Best100Bootstrapreg/KNNreg/linearFIGURE 7.13. Boxplots show the distribution of the relative error100 · [Errα̂ − minα Err(α)]/[maxα Err(α) − minα Err(α)] over the four scenarios of Figure 7.3.
This is the error in using the chosen model relative to the bestmodel. There are 100 training sets represented in each boxplot.Figure 7.13 shows boxplots of 100 · [Errα̂ − minα Err(α)]/[maxα Err(α) −minα Err(α)], the error in using the chosen model relative to the best model.There are 100 different training sets represented in each boxplot. Both measures perform well overall, perhaps the same or slightly worse than the AICin Figure 7.7.Our conclusion is that for these particular problems and fitting methods,minimization of either AIC, cross-validation or bootstrap yields a modelfairly close to the best available.
Note that for the purpose of model selection, any of the measures could be biased and it wouldn’t affect things, aslong as the bias did not change the relative performance of the methods.For example, the addition of a constant to any of the measures would notchange the resulting chosen model. However, for many adaptive, nonlineartechniques (like trees), estimation of the effective number of parameters isvery difficult. This makes methods like AIC impractical and leaves us withcross-validation or bootstrap as the methods of choice.A different question is: how well does each method estimate test error?On the average the AIC criterion overestimated prediction error of its cho-2547. Model Assessment and Selectionsen model by 38%, 37%, 51%, and 30%, respectively, over the four scenarios,with BIC performing similarly.