The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 59
Текст из файла (страница 59)
In contrast, cross-validation overestimatedthe error by 1%, 4%, 0%, and 4%, with the bootstrap doing about thesame. Hence the extra work involved in computing a cross-validation orbootstrap measure is worthwhile, if an accurate estimate of test error isrequired. With other fitting methods like trees, cross-validation and bootstrap can underestimate the true error by 10%, because the search for besttree is strongly affected by the validation set. In these situations only aseparate test set will provide an unbiased estimate of test error.7.12 Conditional or Expected Test Error?Figures 7.14 and 7.15 examine the question of whether cross-validation doesa good job in estimating ErrT , the error conditional on a given training setT (expression (7.15) on page 228), as opposed to the expected test error.For each of 100 training sets generated from the “reg/linear” setting inthe top-right panel of Figure 7.3, Figure 7.14 shows the conditional errorcurves ErrT as a function of subset size (top left).
The next two panels show10-fold and N -fold cross-validation, the latter also known as leave-one-out(LOO). The thick red curve in each plot is the expected error Err, whilethe thick black curves are the expected cross-validation curves.
The lowerright panel shows how well cross-validation approximates the conditionaland expected error.One might have expected N -fold CV to approximate ErrT well, since italmost uses the full training sample to fit a new test point. 10-fold CV, onthe other hand, might be expected to estimate Err well, since it averagesover somewhat different training sets. From the figure it appears 10-folddoes a better job than N -fold in estimating ErrT , and estimates Err evenbetter. Indeed, the similarity of the two black curves with the red curvesuggests both CV curves are approximately unbiased for Err, with 10-foldhaving less variance.
Similar trends were reported by Efron (1983).Figure 7.15 shows scatterplots of both 10-fold and N -fold cross-validationerror estimates versus the true conditional error for the 100 simulations.Although the scatterplots do not indicate much correlation, the lower rightpanel shows that for the most part the correlations are negative, a curious phenomenon that has been observed before. This negative correlationexplains why neither form of CV estimates ErrT well.
The broken lines ineach plot are drawn at Err(p), the expected error for the best subset ofsize p. We see again that both forms of CV are approximately unbiased forexpected error, but the variation in test error for different training sets isquite substantial.Among the four experimental conditions in 7.3, this “reg/linear” scenarioshowed the highest correlation between actual and predicted test error. This7.12 Conditional or Expected Test Error?0.3Error0.20.10.10.2Error0.30.410−Fold CV Error0.4Prediction Error255510152051015Leave−One−Out CV ErrorApproximation Error0.035ET |CV10 −Err|ET |CV10 −ErrT |ET |CVN −ErrT |0.0150.025Mean Absolute Deviation0.3Error0.20.10.045Subset Size p0.4Subset Size p20510Subset Size p15205101520Subset Size pFIGURE 7.14.
Conditional prediction-error ErrT , 10-fold cross-validation, andleave-one-out cross-validation curves for a 100 simulations from the top-rightpanel in Figure 7.3. The thick red curve is the expected prediction error Err,while the thick black curves are the expected CV curves ET CV10 and ET CVN .The lower-right panel shows the mean absolute deviation of the CV curves fromthe conditional error, ET |CVK − ErrT | for K = 10 (blue) and K = N (green),as well as from the expected error ET |CV10 − Err| (orange).2567. Model Assessment and Selection0.30CV Error0.200.100.200.10CV Error0.300.40Subset Size 50.40Subset Size 10.100.150.200.250.300.350.400.100.15Prediction Error0.200.250.300.350.40Prediction ErrorCorrelation−0.2Leave−one−out10−Fold−0.60.10−0.40.20CV Error0.300.00.400.2Subset Size 100.100.150.200.250.30Prediction Error0.350.405101520Subset SizeFIGURE 7.15.
Plots of the CV estimates of error versus the true conditionalerror for each of the 100 training sets, for the simulation setup in the top rightpanel Figure 7.3. Both 10-fold and leave-one-out CV are depicted in differentcolors. The first three panels correspond to different subset sizes p, and verticaland horizontal lines are drawn at Err(p). Although there appears to be little correlation in these plots, we see in the lower right panel that for the most part thecorrelation is negative.Exercises257phenomenon also occurs for bootstrap estimates of error, and we wouldguess, for any other estimate of conditional prediction error.We conclude that estimation of test error for a particular training set isnot easy in general, given just the data from that same training set. Instead,cross-validation and related methods may provide reasonable estimates ofthe expected error Err.Bibliographic NotesKey references for cross-validation are Stone (1974), Stone (1977) andAllen (1974).
The AIC was proposed by Akaike (1973), while the BICwas introduced by Schwarz (1978). Madigan and Raftery (1994) give anoverview of Bayesian model selection. The MDL criterion is due to Rissanen (1983). Cover and Thomas (1991) contains a good description of codingtheory and complexity. VC dimension is described in Vapnik (1996). Stone(1977) showed that the AIC and leave-one out cross-validation are asymptotically equivalent.
Generalized cross-validation is described by Golub etal. (1979) and Wahba (1980); a further discussion of the topic may be foundin the monograph by Wahba (1990). See also Hastie and Tibshirani (1990),Chapter 3. The bootstrap is due to Efron (1979); see Efron and Tibshirani (1993) for an overview. Efron (1983) proposes a number of bootstrapestimates of prediction error, including the optimism and .632 estimates.Efron (1986) compares CV, GCV and bootstrap estimates of error rates.The use of cross-validation and the bootstrap for model selection is studied by Breiman and Spector (1992), Breiman (1992), Shao (1996), Zhang(1993) and Kohavi (1995).
The .632+ estimator was proposed by Efronand Tibshirani (1997).Cherkassky and Ma (2003) published a study on the performance ofSRM for model selection in regression, in response to our study of section7.9.1. They complained that we had been unfair to SRM because had notapplied it properly. Our response can be found in the same issue of thejournal (Hastie et al. (2003)).ExercisesEx.
7.1 Derive the estimate of in-sample error (7.24).Ex. 7.2 For 0–1 loss with Y ∈ {0, 1} and Pr(Y = 1|x0 ) = f (x0 ), show thatErr(x0 )==Pr(Y 6= Ĝ(x0 )|X = x0 )ErrB (x0 ) + |2f (x0 ) − 1|Pr(Ĝ(x0 ) 6= G(x0 )|X = x0 ),(7.62)2587. Model Assessment and Selectionwhere Ĝ(x) = I(fˆ(x) > 12 ), G(x) = I(f (x) > 21 ) is the Bayes classifier,and ErrB (x0 ) = Pr(Y 6= G(x0 )|X = x0 ), the irreducible Bayes error at x0 .Using the approximation fˆ(x0 ) ∼ N (Efˆ(x0 ), Var(fˆ(x0 )), show that!sign( 12 − f (x0 ))(Efˆ(x0 ) − 21 )qPr(Ĝ(x0 ) 6= G(x0 )|X = x0 ) ≈ Φ.
(7.63)Var(fˆ(x0 ))In the above,1Φ(t) = √2πZtexp(−t2 /2)dt,−∞the cumulative Gaussian distribution function. This is an increasing function, with value 0 at t = −∞ and value 1 at t = +∞.We can think of sign( 12 − f (x0 ))(Efˆ(x0 ) − 21 ) as a kind of boundarybias term, as it depends on the true f (x0 ) only through which side of theboundary ( 21 ) that it lies.
Notice also that the bias and variance combinein a multiplicative rather than additive fashion. If Efˆ(x0 ) is on the sameside of 21 as f (x0 ), then the bias is negative, and decreasing the variancewill decrease the misclassification error. On the other hand, if Efˆ(x0 ) ison the opposite side of 12 to f (x0 ), then the bias is positive and it pays toincrease the variance! Such an increase will improve the chance that fˆ(x0 )falls on the correct side of 12 (Friedman, 1997).Ex.
7.3 Let f̂ = Sy be a linear smoothing of y.(a) If Sii is the ith diagonal element of S, show that for S arising from leastsquares projections and cubic smoothing splines, the cross-validatedresidual can be written asyi − fˆ−i (xi ) =yi − fˆ(xi ).1 − Sii(7.64)(b) Use this result to show that |yi − fˆ−i (xi )| ≥ |yi − fˆ(xi )|.(c) Find general conditions on any smoother S to make result (7.64) hold.Ex. 7.4 Consider the in-sample prediction error (7.18) and the trainingerror err in the case of squared-error loss:Errinerr==N1 XEY 0 (Yi0 − fˆ(xi ))2N i=1N1 X(yi − fˆ(xi ))2 .N i=1Exercises259Add and subtract f (xi ) and Efˆ(xi ) in each expression and expand. Henceestablish that the average optimism in the training error isN2 XCov(ŷi , yi ),N i=1as given in (7.21).Ex. 7.5 For a linear smoother ŷ = Sy, show thatNXCov(ŷi , yi ) = trace(S)σε2 ,(7.65)i=1which justifies its use as the effective number of parameters.Ex.
7.6 Show that for an additive-error model, the effective degrees-offreedom for the k-nearest-neighbors regression fit is N/k.Ex. 7.7 Use the approximation 1/(1−x)2 ≈ 1+2x to expose the relationshipbetween Cp /AIC (7.26) and GCV (7.52), the main difference being themodel used to estimate the noise variance σε2 .Ex. 7.8 Show that the set of functions {I(sin(αx) > 0)} can shatter thefollowing points on the line:z 1 = 10−1 , . . . , z ℓ = 10−ℓ ,(7.66)for any ℓ. Hence the VC dimension of the class {I(sin(αx) > 0)} is infinite.Ex.
7.9 For the prostate data of Chapter 3, carry out a best-subset linearregression analysis, as in Table 3.3 (third column from left). Compute theAIC, BIC, five- and tenfold cross-validation, and bootstrap .632 estimatesof prediction error. Discuss the results.Ex. 7.10 Referring to the example in Section 7.10.3, suppose instead thatall of the p predictors are binary, and hence there is no need to estimatesplit points. The predictors are independent of the class labels as before.Then if p is very large, we can probably find a predictor that splits theentire training data perfectly, and hence would split the validation data(one-fifth of data) perfectly as well. This predictor would therefore havezero cross-validation error.