The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 53
Текст из файла (страница 53)
Besides their use in model selection, we alsoexamine to what extent each method provides a reliable estimate of testerror of the final chosen model.Before jumping into these topics, we first explore in more detail thenature of test error and the bias–variance tradeoff.7.3 The Bias–Variance DecompositionAs in Chapter 2, if we assume that Y = f (X) + ε where E(ε) = 0 andVar(ε) = σε2 , we can derive an expression for the expected prediction errorof a regression fit fˆ(X) at an input point X = x0 , using squared-error loss:Err(x0 )====E[(Y − fˆ(x0 ))2 |X = x0 ]σ 2 + [Efˆ(x0 ) − f (x0 )]2 + E[fˆ(x0 ) − Efˆ(x0 )]2εσε2 + Bias2 (fˆ(x0 )) + Var(fˆ(x0 ))Irreducible Error + Bias2 + Variance.(7.9)The first term is the variance of the target around its true mean f (x0 ), andcannot be avoided no matter how well we estimate f (x0 ), unless σε2 = 0.The second term is the squared bias, the amount by which the average ofour estimate differs from the true mean; the last term is the variance; theexpected squared deviation of fˆ(x0 ) around its mean.
Typically the morecomplex we make the model fˆ, the lower the (squared) bias but the higherthe variance.For the k-nearest-neighbor regression fit, these expressions have the simple formErr(x0 )==E[(Y − fˆk (x0 ))2 |X = x0 ]#2"kσ21X2f (x(ℓ) ) + ε .σε + f (x0 ) −kk(7.10)ℓ=1Here we assume for simplicity that training inputs xi are fixed, and the randomness arises from the yi . The number of neighbors k is inversely relatedto the model complexity.
For small k, the estimate fˆk (x) can potentiallyadapt itself better to the underlying f (x). As we increase k, the bias—thesquared difference between f (x0 ) and the average of f (x) at the k-nearestneighbors—will typically increase, while the variance decreases.For a linear model fit fˆp (x) = xT β̂, where the parameter vector β withp components is fit by least squares, we haveErr(x0 )=E[(Y − fˆp (x0 ))2 |X = x0 ]2247. Model Assessment and Selection=σε2 + [f (x0 ) − Efˆp (x0 )]2 + ||h(x0 )||2 σε2 .(7.11)Here h(x0 ) = X(XT X)−1 x0 , the N -vector of linear weights that producethe fit fˆp (x0 ) = x0 T (XT X)−1 XT y, and hence Var[fˆp (x0 )] = ||h(x0 )||2 σε2 .While this variance changes with x0 , its average (with x0 taken to be eachof the sample values xi ) is (p/N )σε2 , and henceNN1 X1 XpErr(xi ) = σε2 +[f (xi ) − Efˆ(xi )]2 + σε2 ,N i=1N i=1N(7.12)the in-sample error.
Here model complexity is directly related to the number of parameters p.The test error Err(x0 ) for a ridge regression fit fˆα (x0 ) is identical inform to (7.11), except the linear weights in the variance term are different:h(x0 ) = X(XT X + αI)−1 x0 . The bias term will also be different.For a linear model family such as ridge regression, we can break downthe bias more finely.
Let β∗ denote the parameters of the best-fitting linearapproximation to f :2β∗ = arg min E f (X) − X T β .(7.13)βHere the expectation is taken with respect to the distribution of the inputvariables X. Then we can write the average squared bias asi2hi2h2Ex0 f (x0 ) − Efˆα (x0 )= Ex0 f (x0 ) − xT0 β∗ + Ex0 xT0 β∗ − ExT0 β̂α=Ave[Model Bias]2 + Ave[Estimation Bias]2(7.14)The first term on the right-hand side is the average squared model bias, theerror between the best-fitting linear approximation and the true function.The second term is the average squared estimation bias, the error betweenthe average estimate E(xT0 β̂) and the best-fitting linear approximation.For linear models fit by ordinary least squares, the estimation bias is zero.For restricted fits, such as ridge regression, it is positive, and we trade it offwith the benefits of a reduced variance.
The model bias can only be reducedby enlarging the class of linear models to a richer collection of models, byincluding interactions and transformations of the variables in the model.Figure 7.2 shows the bias–variance tradeoff schematically. In the caseof linear models, the model space is the set of all linear predictions fromp inputs and the black dot labeled “closest fit” is xT β∗ .
The blue-shadedregion indicates the error σε with which we see the truth in the trainingsample.Also shown is the variance of the least squares fit, indicated by the largeyellow circle centered at the black dot labeled “closest fit in population,’7.3 The Bias–Variance Decomposition225Closest fit in populationRealizationClosest fitTruthMODELSPACEModel biasEstimation BiasShrunken fitEstimationVarianceRESTRICTEDMODEL SPACEFIGURE 7.2. Schematic of the behavior of bias and variance. The model spaceis the set of all possible predictions from the model, with the “closest fit” labeledwith a black dot.
The model bias from the truth is shown, along with the variance,indicated by the large yellow circle centered at the black dot labeled “closest fitin population.” A shrunken or regularized fit is also shown, having additionalestimation bias, but smaller prediction error due to its decreased variance.2267. Model Assessment and SelectionNow if we were to fit a model with fewer predictors, or regularize the coefficients by shrinking them toward zero (say), we would get the “shrunkenfit” shown in the figure. This fit has an additional estimation bias, due tothe fact that it is not the closest fit in the model space.
On the other hand,it has smaller variance. If the decrease in variance exceeds the increase in(squared) bias, then this is worthwhile.7.3.1 Example: Bias–Variance TradeoffFigure 7.3 shows the bias–variance tradeoff for two simulated examples.There are 80 observations and 20 predictors, uniformly distributed in thehypercube [0, 1]20 . The situations are as follows:Left panels: Y is 0 if X1 ≤ 1/2 and 1 if X1 > 1/2, and we apply k-nearestneighbors.P10Right panels: Y is 1 if j=1 Xj is greater than 5 and 0 otherwise, and weuse best subset linear regression of size p.The top row is regression with squared error loss; the bottom row is classification with 0–1 loss. The figures show the prediction error (red), squaredbias (green) and variance (blue), all computed for a large test sample.In the regression problems, bias and variance add to produce the prediction error curves, with minima at about k = 5 for k-nearest neighbors, andp ≥ 10 for the linear model.
For classification loss (bottom figures), someinteresting phenomena can be seen. The bias and variance curves are thesame as in the top figures, and prediction error now refers to misclassification rate. We see that prediction error is no longer the sum of squaredbias and variance. For the k-nearest neighbor classifier, prediction errordecreases or stays the same as the number of neighbors is increased to 20,despite the fact that the squared bias is rising.
For the linear model classifier the minimum occurs for p ≥ 10 as in regression, but the improvementover the p = 1 model is more dramatic. We see that bias and variance seemto interact in determining prediction error.Why does this happen? There is a simple explanation for the first phenomenon.
Suppose at a given input point, the true probability of class 1 is0.9 while the expected value of our estimate is 0.6. Then the squared bias—(0.6 − 0.9)2 —is considerable, but the prediction error is zero since we makethe correct decision. In other words, estimation errors that leave us on theright side of the decision boundary don’t hurt.
Exercise 7.2 demonstratesthis phenomenon analytically, and also shows the interaction effect betweenbias and variance.The overall point is that the bias–variance tradeoff behaves differentlyfor 0–1 loss than it does for squared error loss. This in turn means thatthe best choices of tuning parameters may differ substantially in the two7.3 The Bias–Variance Decomposition0.30.20.10.00.00.10.20.30.4Linear Model − Regression0.4k−NN − Regression22740302010051015Subset Size pk−NN − ClassificationLinear Model − Classification200.30.20.10.00.00.10.20.30.4Number of Neighbors k0.45050403020Number of Neighbors k1005101520Subset Size pFIGURE 7.3. Expected prediction error (orange), squared bias (green) and variance (blue) for a simulated example.
The top row is regression with squared errorloss; the bottom row is classification with 0–1 loss. The models are k-nearestneighbors (left) and best subset regression of size p (right). The variance and biascurves are the same in regression and classification, but the prediction error curveis different.2287. Model Assessment and Selectionsettings. One should base the choice of tuning parameter on an estimate ofprediction error, as described in the following sections.7.4 Optimism of the Training Error RateDiscussions of error rate estimation can be confusing, because we haveto make clear which quantities are fixed and which are random1 . Beforewe continue, we need a few definitions, elaborating on the material of Section 7.2.
Given a training set T = {(x1 , y1 ), (x2 , y2 ), . . . (xN , yN )} the generalization error of a model fˆ isErrT = EX 0 ,Y 0 [L(Y 0 , fˆ(X 0 ))|T ];(7.15)Note that the training set T is fixed in expression (7.15). The point (X 0 , Y 0 )is a new test data point, drawn from F , the joint distribution of the data.Averaging over training sets T yields the expected errorErr = ET EX 0 ,Y 0 [L(Y 0 , fˆ(X 0 ))|T ],(7.16)which is more amenable to statistical analysis.
As mentioned earlier, itturns out that most methods effectively estimate the expected error ratherthan ET ; see Section 7.12 for more on this point.Now typically, the training errorerr =N1 XL(yi , fˆ(xi ))N i=1(7.17)will be less than the true error ErrT , because the same data is being usedto fit the method and assess its error (see Exercise 2.9). A fitting methodtypically adapts to the training data, and hence the apparent or trainingerror err will be an overly optimistic estimate of the generalization errorErrT .Part of the discrepancy is due to where the evaluation points occur.
Thequantity ErrT can be thought of as extra-sample error, since the test inputvectors don’t need to coincide with the training input vectors. The natureof the optimism in err is easiest to understand when we focus instead onthe in-sample errorErrin =N1 XEY 0 [L(Yi0 , fˆ(xi ))|T ]N i=1(7.18)The Y 0 notation indicates that we observe N new response values ateach of the training points xi , i = 1, 2, . .