The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 52
Текст из файла (страница 52)
6.10 Suppose we have N samples generated from the model yi = f (xi )+εi , with εi independent and identically distributed with mean zero andvariance σ 2 , the xi assumed fixed (non random). We estimate f using alinear smoother (local regression, smoothing spline, etc.) with smoothingparameter λ. Thus the vector of fitted values is given by f̂ = Sλ y. Considerthe in-sample prediction errorN1 X ∗ ˆ(y − fλ (xi ))2PE(λ) = EN i=1 i(6.34)2186.
Kernel Smoothing Methodsfor predicting new responses at the N input values. Show that the average squared residual on the training data, ASR(λ), is a biased estimate(optimistic) for PE(λ), whileCλ = ASR(λ) +2σ 2trace(Sλ )N(6.35)is unbiased.Ex. 6.11 Show that for the Gaussian mixture model (6.32) the likelihoodis maximized at +∞, and describe how.Ex. 6.12 Write a computer program to perform a local linear discriminant analysis. At each query point x0 , the training data receive weightsKλ (x0 , xi ) from a weighting kernel, and the ingredients for the linear decision boundaries (see Section 4.3) are computed by weighted averages. Tryout your program on the zipcode data, and show the training and test errors for a series of five pre-chosen values of λ. The zipcode data are availablefrom the book website www-stat.stanford.edu/ElemStatLearn.This is page 219Printer: Opaque this7Model Assessment and Selection7.1 IntroductionThe generalization performance of a learning method relates to its prediction capability on independent test data.
Assessment of this performanceis extremely important in practice, since it guides the choice of learningmethod or model, and gives us a measure of the quality of the ultimatelychosen model.In this chapter we describe and illustrate the key methods for performance assessment, and show how they are used to select models. We beginthe chapter with a discussion of the interplay between bias, variance andmodel complexity.7.2 Bias, Variance and Model ComplexityFigure 7.1 illustrates the important issue in assessing the ability of a learning method to generalize.
Consider first the case of a quantitative or intervalscale response. We have a target variable Y , a vector of inputs X, and aprediction model fˆ(X) that has been estimated from a training set T .The loss function for measuring errors between Y and fˆ(X) is denoted byL(Y, fˆ(X)). Typical choices are((Y − fˆ(X))2squared errorL(Y, fˆ(X)) =(7.1)|Y − fˆ(X)|absolute error.7. Model Assessment and SelectionHigh BiasLow BiasLow VarianceHigh Variance0.60.00.20.4Prediction Error0.81.01.222005101520253035Model Complexity (df)FIGURE 7.1. Behavior of test sample and training sample error as the modelcomplexity is varied.
The light blue curves show the training error err, while thelight red curves show the conditional test error ErrT for 100 training sets of size50 each, as the model complexity is increased. The solid curves show the expectedtest error Err and the expected training error E[err].Test error, also referred to as generalization error, is the prediction errorover an independent test sampleErrT = E[L(Y, fˆ(X))|T ](7.2)where both X and Y are drawn randomly from their joint distribution(population). Here the training set T is fixed, and test error refers to theerror for this specific training set. A related quantity is the expected prediction error (or expected test error)Err = E[L(Y, fˆ(X))] = E[ErrT ].(7.3)Note that this expectation averages over everything that is random, including the randomness in the training set that produced fˆ.Figure 7.1 shows the prediction error (light red curves) ErrT for 100simulated training sets each of size 50.
The lasso (Section 3.4.2) was usedto produce the sequence of fits. The solid red curve is the average, andhence an estimate of Err.Estimation of ErrT will be our goal, although we will see that Err ismore amenable to statistical analysis, and most methods effectively estimate the expected error.
It does not seem possible to estimate conditional7.2 Bias, Variance and Model Complexity221error effectively, given only the information in the same training set. Somediscussion of this point is given in Section 7.12.Training error is the average loss over the training sampleerr =N1 XL(yi , fˆ(xi )).N i=1(7.4)We would like to know the expected test error of our estimated modelfˆ.
As the model becomes more and more complex, it uses the trainingdata more and is able to adapt to more complicated underlying structures.Hence there is a decrease in bias but an increase in variance. There is someintermediate model complexity that gives minimum expected test error.Unfortunately training error is not a good estimate of the test error,as seen in Figure 7.1. Training error consistently decreases with modelcomplexity, typically dropping to zero if we increase the model complexityenough.
However, a model with zero training error is overfit to the trainingdata and will typically generalize poorly.The story is similar for a qualitative or categorical response G takingone of K values in a set G, labeled for convenience as 1, 2, . . . , K. Typicallywe model the probabilities pk (X) = Pr(G = k|X) (or some monotonetransformations fk (X)), and then Ĝ(X) = arg maxk p̂k (X). In some cases,such as 1-nearest neighbor classification (Chapters 2 and 13) we produceĜ(X) directly.
Typical loss functions areL(G, Ĝ(X))L(G, p̂(X))===I(G 6= Ĝ(X))−2KX(0–1 loss),(7.5)I(G = k) log p̂k (X)k=1−2 log p̂G (X)(−2 × log-likelihood).(7.6)The quantity −2 × the log-likelihood is sometimes referred to as thedeviance.Again, test error here is ErrT = E[L(G, Ĝ(X))|T ], the population misclassification error of the classifier trained on T , and Err is the expectedmisclassification error.Training error is the sample analogue, for example,err = −N2 Xlog p̂gi (xi ),N i=1(7.7)the sample log-likelihood for the model.The log-likelihood can be used as a loss-function for general responsedensities, such as the Poisson, gamma, exponential, log-normal and others.If Prθ(X) (Y ) is the density of Y , indexed by a parameter θ(X) that dependson the predictor X, thenL(Y, θ(X)) = −2 · log Prθ(X) (Y ).(7.8)2227.
Model Assessment and SelectionThe “−2” in the definition makes the log-likelihood loss for the Gaussiandistribution match squared-error loss.For ease of exposition, for the remainder of this chapter we will use Y andf (X) to represent all of the above situations, since we focus mainly on thequantitative response (squared-error loss) setting. For the other situations,the appropriate translations are obvious.In this chapter we describe a number of methods for estimating theexpected test error for a model. Typically our model will have a tuningparameter or parameters α and so we can write our predictions as fˆα (x).The tuning parameter varies the complexity of our model, and we wish tofind the value of α that minimizes error, that is, produces the minimum ofthe average test error curve in Figure 7.1. Having said this, for brevity wewill often suppress the dependence of fˆ(x) on α.It is important to note that there are in fact two separate goals that wemight have in mind:Model selection: estimating the performance of different models in orderto choose the best one.Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.If we are in a data-rich situation, the best approach for both problems isto randomly divide the dataset into three parts: a training set, a validationset, and a test set.
The training set is used to fit the models; the validationset is used to estimate prediction error for model selection; the test set isused for assessment of the generalization error of the final chosen model.Ideally, the test set should be kept in a “vault,” and be brought out onlyat the end of the data analysis. Suppose instead that we use the test-setrepeatedly, choosing the model with smallest test-set error. Then the testset error of the final chosen model will underestimate the true test error,sometimes substantially.It is difficult to give a general rule on how to choose the number ofobservations in each of the three parts, as this depends on the signal-tonoise ratio in the data and the training sample size.
A typical split mightbe 50% for training, and 25% each for validation and testing:TrainValidationTestThe methods in this chapter are designed for situations where there isinsufficient data to split it into three parts. Again it is too difficult to givea general rule on how much training data is enough; among other things,this depends on the signal-to-noise ratio of the underlying function, andthe complexity of the models being fit to the data.7.3 The Bias–Variance Decomposition223The methods of this chapter approximate the validation step either analytically (AIC, BIC, MDL, SRM) or by efficient sample re-use (crossvalidation and the bootstrap).