The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 62
Текст из файла (страница 62)
This has a natural connection to Gibbs sampling methods forBayesian inference. Mixture models are discussed and demonstrated in several other parts of the book, in particular Sections 6.8, 12.7 and 13.2.3.The left panel of Figure 8.5 shows a histogram of the 20 fictitious datapoints in Table 8.1.1.00.40.60.8•0.2024273• •• • •• •••• •• • • •• •0.00.00.20.4density0.60.81.08.5 The EM Algorithm602y46yFIGURE 8.5. Mixture example. (Left panel:) Histogram of data. (Right panel:)Maximum likelihood fit of Gaussian densities (solid red) and responsibility (dottedgreen) of the left component density for observation y, as a function of y.TABLE 8.1. Twenty fictitious data points used in the two-component mixtureexample in Figure 8.5.-0.390.060.120.480.941.011.671.681.761.802.443.253.724.124.284.604.925.285.536.22We would like to model the density of the data points, and due to theapparent bi-modality, a Gaussian distribution would not be appropriate.There seems to be two separate underlying regimes, so instead we modelY as a mixture of two normal distributions:Y1Y2Y∼∼=N (µ1 , σ12 ),N (µ2 , σ22 ),(1 − ∆) · Y1 + ∆ · Y2 ,(8.36)where ∆ ∈ {0, 1} with Pr(∆ = 1) = π.
This generative representation isexplicit: generate a ∆ ∈ {0, 1} with probability π, and then depending onthe outcome, deliver either Y1 or Y2 . Let φθ (x) denote the normal densitywith parameters θ = (µ, σ 2 ). Then the density of Y isgY (y) = (1 − π)φθ1 (y) + πφθ2 (y).(8.37)Now suppose we wish to fit this model to the data in Figure 8.5 by maximum likelihood. The parameters areθ = (π, θ1 , θ2 ) = (π, µ1 , σ12 , µ2 , σ22 ).(8.38)The log-likelihood based on the N training cases isℓ(θ; Z) =NXi=1log[(1 − π)φθ1 (yi ) + πφθ2 (yi )].(8.39)2748. Model Inference and AveragingDirect maximization of ℓ(θ; Z) is quite difficult numerically, because ofthe sum of terms inside the logarithm. There is, however, a simpler approach.
We consider unobserved latent variables ∆i taking values 0 or 1 asin (8.36): if ∆i = 1 then Yi comes from model 2, otherwise it comes frommodel 1. Suppose we knew the values of the ∆i ’s. Then the log-likelihoodwould beℓ0 (θ; Z, ∆)=NXi=1[(1 − ∆i ) log φθ1 (yi ) + ∆i log φθ2 (yi )]+NXi=1[(1 − ∆i ) log(1 − π) + ∆i log π] ,(8.40)and the maximum likelihood estimates of µ1 and σ12 would be the samplemean and variance for those data with ∆i = 0, and similarly those for µ2and σ22 would be the sample mean and variance of the data with ∆i = 1.The estimate of π would be the proportion of ∆i = 1.Since the values of the ∆i ’s are actually unknown, we proceed in aniterative fashion, substituting for each ∆i in (8.40) its expected valueγi (θ) = E(∆i |θ, Z) = Pr(∆i = 1|θ, Z),(8.41)also called the responsibility of model 2 for observation i.
We use a procedure called the EM algorithm, given in Algorithm 8.1 for the special case ofGaussian mixtures. In the expectation step, we do a soft assignment of eachobservation to each model: the current estimates of the parameters are usedto assign responsibilities according to the relative density of the trainingpoints under each model. In the maximization step, these responsibilitiesare used in weighted maximum-likelihood fits to update the estimates ofthe parameters.A good way to construct initial guesses for µ̂1 and µ̂2 is simply to choosetwo of the yi at random. Both σ̂12 and σ̂22 can be set equal to the overallPNsample variance i=1 (yi − ȳ)2 /N .
The mixing proportion π̂ can be startedat the value 0.5.Note that the actual maximizer of the likelihood occurs when we put aspike of infinite height at any one data point, that is, µ̂1 = yi for somei and σ̂12 = 0. This gives infinite likelihood, but is not a useful solution.Hence we are actually looking for a good local maximum of the likelihood,one for which σ̂12 , σ̂22 > 0.
To further complicate matters, there can bemore than one local maximum having σ̂12 , σ̂22 > 0. In our example, weran the EM algorithm with a number of different initial guesses for theparameters, all having σ̂k2 > 0.5, and chose the run that gave us the highestmaximized likelihood. Figure 8.6 shows the progressPof the EM algorithm inmaximizing the log-likelihood. Table 8.2 shows π̂ = i γ̂i /N , the maximumlikelihood estimate of the proportion of observations in class 2, at selectediterations of the EM procedure.8.5 The EM Algorithm275Algorithm 8.1 EM Algorithm for Two-component Gaussian Mixture.1. Take initial guesses for the parameters µ̂1 , σ̂12 , µ̂2 , σ̂22 , π̂ (see text).2.
Expectation Step: compute the responsibilitiesγ̂i =π̂φθ̂2 (yi )(1 − π̂)φθ̂1 (yi ) + π̂φθ̂2 (yi ), i = 1, 2, . . . , N.(8.42)3. Maximization Step: compute the weighted means and variances:PN(1 − γ̂i )yiµ̂1 = Pi=1,Ni=1 (1 − γ̂i )PNγ̂i yi,µ̂2 = Pi=1Ni=1 γ̂iand the mixing probability π̂ =σ̂12 =σ̂22 =PNi=1PNi=1 (1 − γ̂i )(yi − µ̂1 )PNi=1 (1 − γ̂i )PN2i=1 γ̂i (yi − µ̂2 ),PNi=1 γ̂i2,γ̂i /N .4. Iterate steps 2 and 3 until convergence.TABLE 8.2.
Selected iterations of the EM algorithm for mixture example.Iteration15101520π̂0.4850.4930.5230.5440.546The final maximum likelihood estimates areσ̂12 = 0.87,σ̂22 = 0.77,µ̂1 = 4.62,µ̂2 = 1.06,π̂ = 0.546.The right panel of Figure 8.5 shows the estimated Gaussian mixture densityfrom this procedure (solid red curve), along with the responsibilities (dottedgreen curve). Note that mixtures are also useful for supervised learning; inSection 6.7 we show how the Gaussian mixture model leads to a version ofradial basis functions.8.
Model Inference and Averagingoooooooooooo-40o-41o-42oooo-43o-44Observed Data Log-likelihood-39276o5101520IterationFIGURE 8.6. EM algorithm: observed data log-likelihood as a function of theiteration number.8.5.2 The EM Algorithm in GeneralThe above procedure is an example of the EM (or Baum–Welch) algorithmfor maximizing likelihoods in certain classes of problems. These problemsare ones for which maximization of the likelihood is difficult, but madeeasier by enlarging the sample with latent (unobserved) data. This is calleddata augmentation. Here the latent data are the model memberships ∆i .In other problems, the latent data are actual data that should have beenobserved but are missing.Algorithm 8.2 gives the general formulation of the EM algorithm.
Ourobserved data is Z, having log-likelihood ℓ(θ; Z) depending on parametersθ. The latent or missing data is Zm , so that the complete data is T =(Z, Zm ) with log-likelihood ℓ0 (θ; T), ℓ0 based on the complete density. Inthe mixture problem (Z, Zm ) = (y, ∆), and ℓ0 (θ; T) is given in (8.40).In our mixture example, E(ℓ0 (θ′ ; T)|Z, θ̂(j) ) is simply (8.40) with the ∆ireplaced by the responsibilities γ̂i (θ̂), and the maximizers in step 3 are justweighted means and variances.We now give an explanation of why the EM algorithm works in general.SincePr(Zm , Z|θ′ ),(8.44)Pr(Zm |Z, θ′ ) =Pr(Z|θ′ )we can writePr(T|θ′ ).(8.45)Pr(Z|θ′ ) =Pr(Zm |Z, θ′ )In terms of log-likelihoods, we have ℓ(θ′ ; Z) = ℓ0 (θ′ ; T)−ℓ1 (θ′ ; Zm |Z), whereℓ1 is based on the conditional density Pr(Zm |Z, θ′ ).
Taking conditionalexpectations with respect to the distribution of T|Z governed by parameterθ givesℓ(θ′ ; Z)=E[ℓ0 (θ′ ; T)|Z, θ] − E[ℓ1 (θ′ ; Zm |Z)|Z, θ]8.5 The EM Algorithm277Algorithm 8.2 The EM Algorithm.1. Start with initial guesses for the parameters θ̂(0) .2.
Expectation Step: at the jth step, computeQ(θ′ , θ̂(j) ) = E(ℓ0 (θ′ ; T)|Z, θ̂(j) )(8.43)as a function of the dummy argument θ′ .3. Maximization Step: determine the new estimate θ̂(j+1) as the maximizer of Q(θ′ , θ̂(j) ) over θ′ .4. Iterate steps 2 and 3 until convergence.≡Q(θ′ , θ) − R(θ′ , θ).(8.46)In the M step, the EM algorithm maximizes Q(θ′ , θ) over θ′ , rather thanthe actual objective function ℓ(θ′ ; Z). Why does it succeed in maximizingℓ(θ′ ; Z)? Note that R(θ∗ , θ) is the expectation of a log-likelihood of a density(indexed by θ∗ ), with respect to the same density indexed by θ, and hence(by Jensen’s inequality) is maximized as a function of θ∗ , when θ∗ = θ (seeExercise 8.1).
So if θ′ maximizes Q(θ′ , θ), we see thatℓ(θ′ ; Z) − ℓ(θ; Z)=≥[Q(θ′ , θ) − Q(θ, θ)] − [R(θ′ , θ) − R(θ, θ)]0.(8.47)Hence the EM iteration never decreases the log-likelihood.This argument also makes it clear that a full maximization in the Mstep is not necessary: we need only to find a value θ̂(j+1) so that Q(θ′ , θ̂(j) )increases as a function of the first argument, that is, Q(θ̂(j+1) , θ̂(j) ) >Q(θ̂(j) , θ̂(j) ). Such procedures are called GEM (generalized EM) algorithms.The EM algorithm can also be viewed as a minorization procedure: seeExercise 8.7.8.5.3 EM as a Maximization–Maximization ProcedureHere is a different view of the EM procedure, as a joint maximizationalgorithm. Consider the functionF (θ′ , P̃ ) = EP̃ [ℓ0 (θ′ ; T)] − EP̃ [log P̃ (Zm )].(8.48)Here P̃ (Zm ) is any distribution over the latent data Zm .
In the mixtureexample, P̃ (Zm ) comprises the set of probabilities γi = Pr(∆i = 1|θ, Z).Note that F evaluated at P̃ (Zm ) = Pr(Zm |Z, θ′ ), is the log-likelihood of8. Model Inference and Averaging0.30.92E0.70.10.5MME01Model Parameters3427812345Latent Data ParametersFIGURE 8.7. Maximization–maximization view of the EM algorithm. Shownare the contours of the (augmented) observed data log-likelihood F (θ′ , P̃ ). TheE step is equivalent to maximizing the log-likelihood over the parameters of thelatent data distribution. The M step maximizes it over the parameters of thelog-likelihood. The red curve corresponds to the observed data log-likelihood, aprofile obtained by maximizing F (θ′ , P̃ ) for each value of θ′ .the observed data, from (8.46)1 .
The function F expands the domain ofthe log-likelihood, to facilitate its maximization.The EM algorithm can be viewed as a joint maximization method for Fover θ′ and P̃ (Zm ), by fixing one argument and maximizing over the other.The maximizer over P̃ (Zm ) for fixed θ′ can be shown to beP̃ (Zm ) = Pr(Zm |Z, θ′ )(8.49)(Exercise 8.2).
This is the distribution computed by the E step, for example,(8.42) in the mixture example. In the M step, we maximize F (θ′ , P̃ ) over θ′with P̃ fixed: this is the same as maximizing the first term EP̃ [ℓ0 (θ′ ; T)|Z, θ]since the second term does not involve θ′ .Finally, since F (θ′ , P̃ ) and the observed data log-likelihood agree whenP̃ (Zm ) = Pr(Zm |Z, θ′ ), maximization of the former accomplishes maximization of the latter. Figure 8.7 shows a schematic view of this process.This view of the EM algorithm leads to alternative maximization proce1(8.46) holds for all θ, including θ = θ ′ .8.6 MCMC for Sampling from the Posterior279Algorithm 8.3 Gibbs Sampler.(0)1.
Take some initial values Uk , k = 1, 2, . . . , K.2. Repeat for t = 1, 2, . . . , . :(t)For k = 1, 2, . . . , K generate Uk from(t)(t)(t)(t−1)(t−1)Pr(Uk |U1 , . . . , Uk−1 , Uk+1 , . . . , UK ).(t)(t)(t)3. Continue step 2 until the joint distribution of (U1 , U2 , . . . , UK )does not change.dures. For example, one does not need to maximize with respect to all ofthe latent data parameters at once, but could instead maximize over oneof them at a time, alternating with the M step.8.6 MCMC for Sampling from the PosteriorHaving defined a Bayesian model, one would like to draw samples fromthe resulting posterior distribution, in order to make inferences about theparameters. Except for simple models, this is often a difficult computational problem. In this section we discuss the Markov chain Monte Carlo(MCMC) approach to posterior sampling.