The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 61
Текст из файла (страница 61)
The advantage of the bootstrap over themaximum likelihood formula is that it allows us to compute maximum likelihood estimates of standard errors and other quantities in settings whereno formulas are available.In our example, suppose that we adaptively choose by cross-validationthe number and position of the knots that define the B-splines, ratherthan fix them in advance. Denote by λ the collection of knots and theirpositions. Then the standard errors and confidence bands should accountfor the adaptive choice of λ, but there is no way to do this analytically.With the bootstrap, we compute the B-spline smooth with an adaptivechoice of knots for each bootstrap sample.
The percentiles of the resultingcurves capture the variability from both the noise in the targets as well asthat from λ̂. In this particular example the confidence bands (not shown)don’t look much different than the fixed λ bands. But in other problems,where more adaptation is used, this can be an important effect to capture.8.3 Bayesian MethodsIn the Bayesian approach to inference, we specify a sampling model Pr(Z|θ)(density or probability mass function) for our data given the parameters,2688. Model Inference and Averagingand a prior distribution for the parameters Pr(θ) reflecting our knowledgeabout θ before we see the data.
We then compute the posterior distributionPr(θ|Z) = RPr(Z|θ) · Pr(θ),Pr(Z|θ) · Pr(θ)dθ(8.23)which represents our updated knowledge about θ after we see the data. Tounderstand this posterior distribution, one might draw samples from it orsummarize by computing its mean or mode. The Bayesian approach differsfrom the standard (“frequentist”) method for inference in its use of a priordistribution to express the uncertainty present before seeing the data, andto allow the uncertainty remaining after seeing the data to be expressed inthe form of a posterior distribution.The posterior distribution also provides the basis for predicting the valuesof a future observation z new , via the predictive distribution:Pr(z new |Z) =ZPr(z new |θ) · Pr(θ|Z)dθ.(8.24)In contrast, the maximum likelihood approach would use Pr(z new |θ̂),the data density evaluated at the maximum likelihood estimate, to predictfuture data.
Unlike the predictive distribution (8.24), this does not accountfor the uncertainty in estimating θ.Let’s walk through the Bayesian approach in our smoothing example.We start with the parametric model given by equation (8.5), and assumefor the moment that σ 2 is known. We assume that the observed featurevalues x1 , x2 , . . . , xN are fixed, so that the randomness in the data comessolely from y varying around its mean µ(x).The second ingredient we need is a prior distribution. Distributions onfunctions are fairly complex entities: one approach is to use a Gaussianprocess prior in which we specify the prior covariance between any twofunction values µ(x) and µ(x′ ) (Wahba, 1990; Neal, 1996).Here we take a simpler route: by considering a finite B-spline basis forµ(x), we can instead provide a prior for the coefficients β, and this implicitlydefines a prior for µ(x).
We choose a Gaussian prior centered at zeroβ ∼ N (0, τ Σ)(8.25)with the choices of the prior correlation matrix Σ and variance τ to bediscussed below. The implicit process prior for µ(x) is hence Gaussian,with covariance kernelK(x, x′ )==cov[µ(x), µ(x′ )]τ · h(x)T Σh(x′ ).(8.26)2690-3-2-1µ(x)1238.3 Bayesian Methods0.00.51.01.52.02.53.0xFIGURE 8.3. Smoothing example: Ten draws from the Gaussian prior distribution for the function µ(x).The posterior distribution for β is also Gaussian, with mean and covariance−1σ 2 −1E(β|Z) = HT H +ΣHT y,τ(8.27)−1σ 2 −12TΣσ ,cov(β|Z) = H H +τwith the corresponding posterior values for µ(x),−1σ 2 −1E(µ(x)|Z) = h(x)T HT H +ΣHT y,τ−1σ 2 −1T′TΣh(x′ )σ 2 .H H+cov[µ(x), µ(x )|Z] = h(x)τ(8.28)How do we choose the prior correlation matrix Σ? In some settings theprior can be chosen from subject matter knowledge about the parameters.Here we are willing to say the function µ(x) should be smooth, and haveguaranteed this by expressing µ in a smooth low-dimensional basis of Bsplines.
Hence we can take the prior correlation matrix to be the identityΣ = I. When the number of basis functions is large, this might not be sufficient, and additional smoothness can be enforced by imposing restrictionson Σ; this is exactly the case with smoothing splines (Section 5.8.1).Figure 8.3 shows ten draws from the corresponding prior for µ(x). Togenerate posterior values of the function µ(x), we generate values β ′ from itsP7posterior (8.27), giving corresponding posterior value µ′ (x) = 1 βj′ hj (x).Ten such posterior curves are shown in Figure 8.4. Two different valueswere used for the prior variance τ , 1 and 1000.
Notice how similar theright panel looks to the bootstrap distribution in the bottom left panel2708. Model Inference and Averagingτ =1τ = 10000.51.01.5x53•2.02.53.0• • •• •• ••••21•••••••0•• •• •••••••••• •••••••• ••••••-13-1012•0.0•• • •• •• •••µ(x)•µ(x)•445••• •• •••••••••• •••••••• •••0.00.51.01.5•••••••••••2.02.53.0xFIGURE 8.4. Smoothing example: Ten draws from the posterior distributionfor the function µ(x), for two different values of the prior variance τ .
The purplecurves are the posterior means.of Figure 8.2 on page 263. This similarity is no accident. As τ → ∞, theposterior distribution (8.27) and the bootstrap distribution (8.7) coincide.On the other hand, for τ = 1, the posterior curves µ(x) in the left panelof Figure 8.4 are smoother than the bootstrap curves, because we haveimposed more prior weight on smoothness.The distribution (8.25) with τ → ∞ is called a noninformative prior forθ.
In Gaussian models, maximum likelihood and parametric bootstrap analyses tend to agree with Bayesian analyses that use a noninformative priorfor the free parameters. These tend to agree, because with a constant prior,the posterior distribution is proportional to the likelihood. This correspondence also extends to the nonparametric case, where the nonparametricbootstrap approximates a noninformative Bayes analysis; Section 8.4 hasthe details.We have, however, done some things that are not proper from a Bayesianpoint of view.
We have used a noninformative (constant) prior for σ 2 andreplaced it with the maximum likelihood estimate σ̂ 2 in the posterior. Amore standard Bayesian analysis would also put a prior on σ (typicallyg(σ) ∝ 1/σ), calculate a joint posterior for µ(x) and σ, and then integrateout σ, rather than just extract the maximum of the posterior distribution(“MAP” estimate).8.4 Relationship Between the Bootstrap and Bayesian Inference2718.4 Relationship Between the Bootstrap andBayesian InferenceConsider first a very simple example, in which we observe a single observation z from a normal distributionz ∼ N (θ, 1).(8.29)To carry out a Bayesian analysis for θ, we need to specify a prior. Themost convenient and common choice would be θ ∼ N (0, τ ) giving posteriordistribution1z.(8.30),θ|z ∼ N1 + 1/τ 1 + 1/τNow the larger we take τ , the more concentrated the posterior becomesaround the maximum likelihood estimate θ̂ = z.
In the limit as τ → ∞ weobtain a noninformative (constant) prior, and the posterior distribution isθ|z ∼ N (z, 1).(8.31)This is the same as a parametric bootstrap distribution in which we generate bootstrap values z ∗ from the maximum likelihood estimate of thesampling density N (z, 1).There are three ingredients that make this correspondence work:1. The choice of noninformative prior for θ.2. The dependence of the log-likelihood ℓ(θ; Z) on the data Z onlythrough the maximum likelihood estimate θ̂.
Hence we can write thelog-likelihood as ℓ(θ; θ̂).3. The symmetry of the log-likelihood in θ and θ̂, that is, ℓ(θ; θ̂) =ℓ(θ̂; θ) + constant.Properties (2) and (3) essentially only hold for the Gaussian distribution. However, they also hold approximately for the multinomial distribution, leading to a correspondence between the nonparametric bootstrapand Bayes inference, which we outline next.Assume that we have a discrete sample space with L categories. Let wj bethe probability that a sample point falls in category j, and ŵj the observedproportion in category j. Let w = (w1 , w2 , . . . , wL ), ŵ = (ŵ1 , ŵ2 , . . . , ŵL ).Denote our estimator by S(ŵ); take as a prior distribution for w a symmetric Dirichlet distribution with parameter a:w ∼ DiL (a1),(8.32)2728.
Model Inference and Averagingthat is, the prior probability mass function is proportional toThen the posterior density of w isw ∼ DiL (a1 + N ŵ),QLℓ=1wℓa−1 .(8.33)where N is the sample size. Letting a → 0 to obtain a noninformative priorgivesw ∼ DiL (N ŵ).(8.34)Now the bootstrap distribution, obtained by sampling with replacementfrom the data, can be expressed as sampling the category proportions froma multinomial distribution.
Specifically,N ŵ∗ ∼ Mult(N, ŵ),(8.35)where Mult(N, ŵ) denotes a multinomial distribution, having probability Q N ŵℓ∗Nŵℓ . This distribution is similar to the posmass function N ŵ∗ ,...,N∗ŵ1Lterior distribution above, having the same support, same mean, and nearlythe same covariance matrix. Hence the bootstrap distribution of S(ŵ∗ ) willclosely approximate the posterior distribution of S(w).In this sense, the bootstrap distribution represents an (approximate)nonparametric, noninformative posterior distribution for our parameter.But this bootstrap distribution is obtained painlessly—without having toformally specify a prior and without having to sample from the posteriordistribution.
Hence we might think of the bootstrap distribution as a “poorman’s” Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typicallymuch simpler to carry out.8.5 The EM AlgorithmThe EM algorithm is a popular tool for simplifying difficult maximumlikelihood problems. We first describe it in the context of a simple mixturemodel.8.5.1 Two-Component Mixture ModelIn this section we describe a simple mixture model for density estimation,and the associated EM algorithm for carrying out maximum likelihoodestimation.