The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 54
Текст из файла (страница 54)
. , N . We define the optimism as1 Indeed,in the first edition of our book, this section wasn’t sufficiently clear.7.4 Optimism of the Training Error Rate229the difference between Errin and the training error err:op ≡ Errin − err.(7.19)This is typically positive since err is usually biased downward as an estimateof prediction error. Finally, the average optimism is the expectation of theoptimism over training setsω ≡ Ey (op).(7.20)Here the predictors in the training set are fixed, and the expectation isover the training set outcome values; hence we have used the notation Eyinstead of ET . We can usually estimate only the expected error ω ratherthan op, in the same way that we can estimate the expected error Errrather than the conditional error ErrT .For squared error, 0–1, and other loss functions, one can show quitegenerally thatω=N2 XCov(ŷi , yi ),N i=1(7.21)where Cov indicates covariance.
Thus the amount by which err underestimates the true error depends on how strongly yi affects its own prediction.The harder we fit the data, the greater Cov(ŷi , yi ) will be, thereby increasing the optimism. Exercise 7.4 proves this result for squared error loss whereŷi is the fitted value from the regression. For 0–1 loss, ŷi ∈ {0, 1} is theclassification at xi , and for entropy loss, ŷi ∈ [0, 1] is the fitted probabilityof class 1 at xi .In summary, we have the important relationEy (Errin ) = Ey (err) +N2 XCov(ŷi , yi ).N i=1(7.22)This expression simplifies if ŷi is obtained by a linear fit with d inputsor basis functions. For example,NXCov(ŷi , yi ) = dσε2(7.23)i=1for the additive error model Y = f (X) + ε, and soEy (Errin ) = Ey (err) + 2 ·d 2σ .N ε(7.24)Expression (7.23) is the basis for the definition of the effective number ofparameters discussed in Section 7.6 The optimism increases linearly with2307.
Model Assessment and Selectionthe number d of inputs or basis functions we use, but decreases as thetraining sample size increases. Versions of (7.24) hold approximately forother error models, such as binary data and entropy loss.An obvious way to estimate prediction error is to estimate the optimismand then add it to the training error err. The methods described in thenext section—Cp , AIC, BIC and others—work in this way, for a specialclass of estimates that are linear in their parameters.In contrast, cross-validation and bootstrap methods, described later inthe chapter, are direct estimates of the extra-sample error Err. These general tools can be used with any loss function, and with nonlinear, adaptivefitting techniques.In-sample error is not usually of direct interest since future values of thefeatures are not likely to coincide with their training set values.
But forcomparison between models, in-sample error is convenient and often leadsto effective model selection. The reason is that the relative (rather thanabsolute) size of the error is what matters.7.5 Estimates of In-Sample Prediction ErrorThe general form of the in-sample estimates isdin = err + ω̂,Err(7.25)where ω̂ is an estimate of the average optimism.Using expression (7.24), applicable when d parameters are fit undersquared error loss, leads to a version of the so-called Cp statistic,Cp = err + 2 ·d 2σˆε .N(7.26)Here σˆε 2 is an estimate of the noise variance, obtained from the meansquared error of a low-bias model.
Using this criterion we adjust the trainingerror by a factor proportional to the number of basis functions used.The Akaike information criterion is a similar but more generally applicable estimate of Errin when a log-likelihood loss function is used. It relieson a relationship similar to (7.24) that holds asymptotically as N → ∞:−2 · E[log Prθ̂ (Y )] ≈ −d2· E[loglik] + 2 · .NN(7.27)Here Prθ (Y ) is a family of densities for Y (containing the “true” density),θ̂ is the maximum-likelihood estimate of θ, and “loglik” is the maximizedlog-likelihood:NXlog Prθ̂ (yi ).(7.28)loglik =i=17.5 Estimates of In-Sample Prediction Error231For example, for the logistic regression model, using the binomial loglikelihood, we haveAIC = −2d· loglik + 2 · .NN(7.29)For the Gaussian model (with variance σε2 = σˆε 2 assumed known), the AICstatistic is equivalent to Cp , and so we refer to them collectively as AIC.To use AIC for model selection, we simply choose the model giving smallest AIC over the set of models considered.
For nonlinear and other complexmodels, we need to replace d by some measure of model complexity. Wediscuss this in Section 7.6.Given a set of models fα (x) indexed by a tuning parameter α, denoteby err(α) and d(α) the training error and number of parameters for eachmodel. Then for this set of models we defineAIC(α) = err(α) + 2 ·d(α) 2σˆε .N(7.30)The function AIC(α) provides an estimate of the test error curve, and wefind the tuning parameter α̂ that minimizes it. Our final chosen modelis fα̂ (x). Note that if the basis functions are chosen adaptively, (7.23) nolonger holds.
For example, if we have a total of p inputs, and we choosethe best-fitting linear model with d < p inputs, the optimism will exceed(2d/N )σε2 . Put another way, by choosing the best-fitting model with dinputs, the effective number of parameters fit is more than d.Figure 7.4 shows AIC in action for the phoneme recognition exampleof Section 5.2.3 on page 148. The input vector is the log-periodogram ofthe spoken vowel, quantized to 256 uniformly spaced frequencies. A linear logistic regression modelis used to predict the phoneme class, withPMcoefficient function β(f ) = m=1 hm (f )θm , an expansion in M spline basis functions. For any given M , a basis of natural cubic splines is usedfor the hm , with knots chosen uniformly over the range of frequencies (sod(α) = d(M ) = M ).
Using AIC to select the number of basis functions willapproximately minimize Err(M ) for both entropy and 0–1 loss.The simple formula(2/N )NXCov(ŷi , yi ) = (2d/N )σε2i=1holds exactly for linear models with additive errors and squared error loss,and approximately for linear models and log-likelihoods. In particular, theformula does not hold in general for 0–1 loss (Efron, 1986), although manyauthors nevertheless use it in that context (right panel of Figure 7.4).2327. Model Assessment and Selection0-1 LossOOOOOOOOOO0.5OOOOOO0.25OOOOOOOOOOOOOOOOOOO248163264O0.10OOOO0.201.5Misclassification Error2.00.30TrainTestAICO0.152.5O1.0Log-likelihood0.35Log-likelihood Loss1282Number of Basis Functions48163264128Number of Basis FunctionsFIGURE 7.4.
AIC used for model selection for the phoneme recognition exampleof Section 5.2.3. The logistic regression coefficient functionPβ(f ) = Mm=1 hm (f )θm is modeled as an expansion in M spline basis functions.In the left panel we see the AIC statistic used to estimate Errin using log-likelihood loss. Included is an estimate of Err based on an independent test sample. Itdoes well except for the extremely over-parametrized case (M = 256 parametersfor N = 1000 observations).
In the right panel the same is done for 0–1 loss.Although the AIC formula does not strictly apply here, it does a reasonable job inthis case.7.6 The Effective Number of ParametersThe concept of “number of parameters” can be generalized, especially tomodels where regularization is used in the fitting.
Suppose we stack theoutcomes y1 , y2 , . . . , yN into a vector y, and similarly for the predictionsŷ. Then a linear fitting method is one for which we can writeŷ = Sy,(7.31)where S is an N × N matrix depending on the input vectors xi but not onthe yi .
Linear fitting methods include linear regression on the original features or on a derived basis set, and smoothing methods that use quadraticshrinkage, such as ridge regression and cubic smoothing splines. Then theeffective number of parameters is defined asdf(S) = trace(S),(7.32)the sum of the diagonal elements of S (also known as the effective degreesof-freedom). Note that if S is an orthogonal-projection matrix onto a basis7.7 The Bayesian Approach and BIC233set spanned by M features, then trace(S) = M .
It turns out that trace(S) isexactly the correct quantity to replace d as the number of parameters in theCp statistic (7.26). If y arises from an additive-error model Y = f (X) + εPNwith Var(ε) = σε2 , then one can show that i=1 Cov(ŷi , yi ) = trace(S)σε2 ,which motivates the more general definitionPNCov(ŷi , yi )df(ŷ) = i=1 2(7.33)σε(Exercises 7.4 and 7.5).
Section 5.4.1 on page 153 gives some more intuitionfor the definition df = trace(S) in the context of smoothing splines.For models like neural networks, in which we minimizeP an2 error function, the effectiveR(w) with weight decay penalty (regularization) α m wmnumber of parameters has the formdf(α) =MXθm,θ+αm=1 m(7.34)where the θm are the eigenvalues of the Hessian matrix ∂ 2 R(w)/∂w∂wT .Expression (7.34) follows from (7.32) if we make a quadratic approximationto the error function at the solution (Bishop, 1995).7.7 The Bayesian Approach and BICThe Bayesian information criterion (BIC), like AIC, is applicable in settingswhere the fitting is carried out by maximization of a log-likelihood.
Thegeneric form of BIC isBIC = −2 · loglik + (log N ) · d.(7.35)The BIC statistic (times 1/2) is also known as the Schwarz criterion (Schwarz,1978).Under the Gaussian model, assuming the variance σε2 is known, −2·loglikPequals (up to a constant) i (yi − fˆ(xi ))2 /σε2 , which is N ·err/σε2 for squarederror loss.
Hence we can writeNhd i(7.36)BIC = 2 err + (log N ) · σε2 .σεNTherefore BIC is proportional to AIC (Cp ), with the factor 2 replacedby log N . Assuming N > e2 ≈ 7.4, BIC tends to penalize complex modelsmore heavily, giving preference to simpler models in selection. As with AIC,σε2 is typically estimated by the mean squared error of a low-bias model.For classification problems, use of the multinomial log-likelihood leads to asimilar relationship with the AIC, using cross-entropy as the error measure.2347. Model Assessment and SelectionNote however that the misclassification error measure does not arise in theBIC context, since it does not correspond to the log-likelihood of the dataunder any probability model.Despite its similarity with AIC, BIC is motivated in quite a differentway.
It arises in the Bayesian approach to model selection, which we nowdescribe.Suppose we have a set of candidate models Mm , m = 1, . . . , M andcorresponding model parameters θm , and we wish to choose a best modelfrom among them. Assuming we have a prior distribution Pr(θm |Mm ) forthe parameters of each model Mm , the posterior probability of a givenmodel isPr(Mm |Z)∝ Pr(Mm ) · Pr(Z|Mm )(7.37)Z∝ Pr(Mm ) · Pr(Z|θm , Mm )Pr(θm |Mm )dθm ,Nwhere Z represents the training data {xi , yi }1 . To compare two modelsMm and Mℓ , we form the posterior oddsPr(Mm |Z)Pr(Mm ) Pr(Z|Mm )=·.Pr(Mℓ |Z)Pr(Mℓ ) Pr(Z|Mℓ )(7.38)If the odds are greater than one we choose model m, otherwise we choosemodel ℓ. The rightmost quantityBF(Z) =Pr(Z|Mm )Pr(Z|Mℓ )(7.39)is called the Bayes factor, the contribution of the data toward the posteriorodds.Typically we assume that the prior over models is uniform, so thatPr(Mm ) is constant.