The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 60
Текст из файла (страница 60)
Does this mean that cross-validation does notprovide a good estimate of test error in this situation? [This question wassuggested by Li Ma.]2607. Model Assessment and SelectionThis is page 261Printer: Opaque this8Model Inference and Averaging8.1 IntroductionFor most of this book, the fitting (learning) of models has been achieved byminimizing a sum of squares for regression, or by minimizing cross-entropyfor classification.
In fact, both of these minimizations are instances of themaximum likelihood approach to fitting.In this chapter we provide a general exposition of the maximum likelihood approach, as well as the Bayesian method for inference. The bootstrap, introduced in Chapter 7, is discussed in this context, and its relationto maximum likelihood and Bayes is described. Finally, we present somerelated techniques for model averaging and improvement, including committee methods, bagging, stacking and bumping.8.2 The Bootstrap and Maximum LikelihoodMethods8.2.1 A Smoothing ExampleThe bootstrap method provides a direct computational way of assessinguncertainty, by sampling from the training data.
Here we illustrate thebootstrap in a simple one-dimensional smoothing problem, and show itsconnection to maximum likelihood.1.08. Model Inference and Averaging•0.00.51.01.5••••••••••0.60.0•• •• •••••• •••• •••••••• •••0.43•2-101y• • •• •• •••0.2•B-spline Basis0.845262•2.02.53.00.0x0.51.01.52.02.53.0xFIGURE 8.1. (Left panel): Data for smoothing example. (Right panel:) Set ofseven B-spline basis functions. The broken vertical lines indicate the placementof the three knots.Denote the training data by Z = {z1 , z2 , . .
. , zN }, with zi = (xi , yi ),i = 1, 2, . . . , N . Here xi is a one-dimensional input, and yi the outcome,either continuous or categorical. As an example, consider the N = 50 datapoints shown in the left panel of Figure 8.1.Suppose we decide to fit a cubic spline to the data, with three knotsplaced at the quartiles of the X values.
This is a seven-dimensional linear space of functions, and can be represented, for example, by a linearexpansion of B-spline basis functions (see Section 5.9.2):µ(x)=7Xβj hj (x).(8.1)j=1Here the hj (x), j = 1, 2, . . . , 7 are the seven functions shown in the rightpanel of Figure 8.1. We can think of µ(x) as representing the conditionalmean E(Y |X = x).Let H be the N × 7 matrix with ijth element hj (xi ). The usual estimateof β, obtained by minimizing the squared error over the training set, isgiven byβ̂ = (HT H)−1 HT y.(8.2)P7The corresponding fit µ̂(x) = j=1 β̂j hj (x) is shown in the top left panelof Figure 8.2.The estimated covariance matrix of β̂ isd β̂) = (HT H)−1 σ̂ 2 ,Var((8.3)PNwhere we have estimated the noise variance by σ̂ 2 = i=1 (yi − µ̂(xi ))2 /N .Letting h(x)T = (h1 (x), h2 (x), .
. . , h7 (x)), the standard error of a predic-8.2 The Bootstrap and Maximum Likelihood Methods0.51.05310•••••••-1•• •• •••••••••• •••••••• •••••••1.52.02.53.0••• •• •••••••••• •••••••• •••0.00.51.0••••2.02.53.05445•0.51.01.5x3•2.02.53.0• • •• •• ••••210•••••••-1•• •• •••••• ••••• ••••••• ••••••y3•0.0•• • •• •• •••210•x••-1••••••1.5xy• • •• •• •••23-1012•0.0•• • •• •• •••y•y•445•263•• •• •••••• ••••• ••••••• •••0.00.51.01.5•••••••••••2.02.53.0xFIGURE 8.2. (Top left:) B-spline smooth of data. (Top right:) B-spline smoothplus and minus 1.96× standard error bands. (Bottom left:) Ten bootstrap replicates of the B-spline smooth. (Bottom right:) B-spline smooth with 95% standarderror bands computed from the bootstrap distribution.2648.
Model Inference and Averagingtion µ̂(x) = h(x)T β̂ is1se[µ̂(x)]b= [h(x)T (HT H)−1 h(x)] 2 σ̂.(8.4)In the top right panel of Figure 8.2 we have plotted µ̂(x) ± 1.96 · se[µ̂(x)].bSince 1.96 is the 97.5% point of the standard normal distribution, theserepresent approximate 100 − 2 × 2.5% = 95% pointwise confidence bandsfor µ(x).Here is how we could apply the bootstrap in this example. We draw Bdatasets each of size N = 50 with replacement from our training data, thesampling unit being the pair zi = (xi , yi ).
To each bootstrap dataset Z∗we fit a cubic spline µ̂∗ (x); the fits from ten such samples are shown in thebottom left panel of Figure 8.2. Using B = 200 bootstrap samples, we canform a 95% pointwise confidence band from the percentiles at each x: wefind the 2.5% × 200 = fifth largest and smallest values at each x. These areplotted in the bottom right panel of Figure 8.2. The bands look similar tothose in the top right, being a little wider at the endpoints.There is actually a close connection between the least squares estimates(8.2) and (8.3), the bootstrap, and maximum likelihood. Suppose we furtherassume that the model errors are Gaussian,Y=µ(x)=µ(X) + ε; ε ∼ N (0, σ 2 ),7Xβj hj (x).(8.5)j=1The bootstrap method described above, in which we sample with replacement from the training data, is called the nonparametric bootstrap.This really means that the method is “model-free,” since it uses the rawdata, not a specific parametric model, to generate new datasets.
Considera variation of the bootstrap, called the parametric bootstrap, in which wesimulate new responses by adding Gaussian noise to the predicted values:yi∗ = µ̂(xi ) + ε∗i ;ε∗i ∼ N (0, σ̂ 2 ); i = 1, 2, . . . , N.(8.6)This process is repeated B times, where B = 200 say. The resulting boot∗strap datasets have the form (x1 , y1∗ ), . . . , (xN , yN) and we recompute theB-spline smooth on each. The confidence bands from this method will exactly equal the least squares bands in the top right panel, as the number ofbootstrap samples goes to infinity.
A function estimated from a bootstrapsample y∗ is given by µ̂∗ (x) = h(x)T (HT H)−1 HT y∗ , and has distributionµ̂∗ (x) ∼ N (µ̂(x), h(x)T (HT H)−1 h(x)σ̂ 2 ).(8.7)Notice that the mean of this distribution is the least squares estimate, andthe standard deviation is the same as the approximate formula (8.4).8.2 The Bootstrap and Maximum Likelihood Methods2658.2.2 Maximum Likelihood InferenceIt turns out that the parametric bootstrap agrees with least squares in theprevious example because the model (8.5) has additive Gaussian errors. Ingeneral, the parametric bootstrap agrees not with least squares but withmaximum likelihood, which we now review.We begin by specifying a probability density or probability mass functionfor our observationszi ∼ gθ (z).(8.8)In this expression θ represents one or more unknown parameters that govern the distribution of Z.
This is called a parametric model for Z. As anexample, if Z has a normal distribution with mean µ and variance σ 2 , thenθ = (µ, σ 2 ),(8.9)andgθ (z) = √2211e− 2 (z−µ) /σ .2πσ(8.10)Maximum likelihood is based on the likelihood function, given byL(θ; Z) =NYgθ (zi ),(8.11)i=1the probability of the observed data under the model gθ .
The likelihood isdefined only up to a positive multiplier, which we have taken to be one.We think of L(θ; Z) as a function of θ, with our data Z fixed.Denote the logarithm of L(θ; Z) byℓ(θ; Z)=NXℓ(θ; zi )i=1=NXlog gθ (zi ),(8.12)i=1which we will sometimes abbreviate as ℓ(θ). This expression is called thelog-likelihood, and each value ℓ(θ; zi ) = log gθ (zi ) is called a log-likelihoodcomponent.
The method of maximum likelihood chooses the value θ = θ̂to maximize ℓ(θ; Z).The likelihood function can be used to assess the precision of θ̂. We needa few more definitions. The score function is defined byℓ̇(θ; Z) =NXi=1ℓ̇(θ; zi ),(8.13)2668. Model Inference and Averagingwhere ℓ̇(θ; zi ) = ∂ℓ(θ; zi )/∂θ. Assuming that the likelihood takes its maximum in the interior of the parameter space, ℓ̇(θ̂; Z) = 0.
The informationmatrix isI(θ) = −NX∂ 2 ℓ(θ; zi )i=1∂θ∂θT.(8.14)When I(θ) is evaluated at θ = θ̂, it is often called the observed information.The Fisher information (or expected information) isi(θ) = Eθ [I(θ)].(8.15)Finally, let θ0 denote the true value of θ.A standard result says that the sampling distribution of the maximumlikelihood estimator has a limiting normal distributionθ̂ → N (θ0 , i(θ0 )−1 ),(8.16)as N → ∞.
Here we are independently sampling from gθ0 (z). This suggeststhat the sampling distribution of θ̂ may be approximated byN (θ̂, i(θ̂)−1 ) or N (θ̂, I(θ̂)−1 ),(8.17)where θ̂ represents the maximum likelihood estimate from the observeddata.The corresponding estimates for the standard errors of θ̂j are obtainedfromqqi(θ̂)−1andI(θ̂)−1(8.18)jjjj .Confidence points for θj can be constructed from either approximationin (8.17).
Such a confidence point has the formqqorθ̂j − z (1−α) · I(θ̂)−1θ̂j − z (1−α) · i(θ̂)−1jjjj ,respectively, where z (1−α) is the 1 − α percentile of the standard normaldistribution. More accurate confidence intervals can be derived from thelikelihood function, by using the chi-squared approximation2[ℓ(θ̂) − ℓ(θ0 )] ∼ χ2p ,(8.19)where p is the number of components in θ. The resulting 1 − 2α confi(1−2α)dence interval is the set of all θ0 such that 2[ℓ(θ̂) − ℓ(θ0 )] ≤ χ2p,(1−2α)where χ2pis the 1 − 2α percentile of the chi-squared distribution withp degrees of freedom.8.3 Bayesian Methods267Let’s return to our smoothing example to see what maximum likelihoodyields.
The parameters are θ = (β, σ 2 ). The log-likelihood isℓ(θ) = −N1 XNlog σ 2 2π − 2(yi − h(xi )T β)2 .22σ i=1(8.20)The maximum likelihood estimate is obtained by setting ∂ℓ/∂β = 0 and∂ℓ/∂σ 2 = 0, givingβ̂ = (HT H)−1 HT y,1 Xσ̂ 2 =(yi − µ̂(xi ))2 ,N(8.21)which are the same as the usual estimates given in (8.2) and below (8.3).The information matrix for θ = (β, σ 2 ) is block-diagonal, and the blockcorresponding to β isI(β) = (HT H)/σ 2 ,(8.22)so that the estimated variance (HT H)−1 σ̂ 2 agrees with the least squaresestimate (8.3).8.2.3 Bootstrap versus Maximum LikelihoodIn essence the bootstrap is a computer implementation of nonparametric orparametric maximum likelihood.