Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 102
Текст из файла (страница 102)
In other words, for any given (nondegenerate)point in the space of parameter values there will be a further K! − 1 additional pointsall of which give rise to exactly the same distribution. This problem is known as9.2. Mixtures of Gaussians435identifiability (Casella and Berger, 2002) and is an important issue when we wish tointerpret the parameter values discovered by a model. Identifiability will also arisewhen we discuss models having continuous latent variables in Chapter 12. However,for the purposes of finding a good density model, it is irrelevant because any of theequivalent solutions is as good as any other.Maximizing the log likelihood function (9.14) for a Gaussian mixture modelturns out to be a more complex problem than for the case of a single Gaussian.
Thedifficulty arises from the presence of the summation over k that appears inside thelogarithm in (9.14), so that the logarithm function no longer acts directly on theGaussian. If we set the derivatives of the log likelihood to zero, we will no longerobtain a closed form solution, as we shall see shortly.One approach is to apply gradient-based optimization techniques (Fletcher, 1987;Nocedal and Wright, 1999; Bishop and Nabney, 2008). Although gradient-basedtechniques are feasible, and indeed will play an important role when we discussmixture density networks in Chapter 5, we now consider an alternative approachknown as the EM algorithm which has broad applicability and which will lay thefoundations for a discussion of variational inference techniques in Chapter 10.9.2.2 EM for Gaussian mixturesSection 10.1An elegant and powerful method for finding maximum likelihood solutions formodels with latent variables is called the expectation-maximization algorithm, or EMalgorithm (Dempster et al., 1977; McLachlan and Krishnan, 1997).
Later we shallgive a general treatment of EM, and we shall also show how EM can be generalizedto obtain the variational inference framework. Initially, we shall motivate the EMalgorithm by giving a relatively informal treatment in the context of the Gaussianmixture model. We emphasize, however, that EM has broad applicability, and indeedit will be encountered in the context of a variety of different models in this book.Let us begin by writing down the conditions that must be satisfied at a maximumof the likelihood function.
Setting the derivatives of ln p(X|π, µ, Σ) in (9.14) withrespect to the means µk of the Gaussian components to zero, we obtainNπ N (xn |µk , Σk )kΣk (xn − µk )0=−πj N (xn |µj , Σj )n=1 ( j)*+γ(znk )(9.16)where we have made use of the form (2.43) for the Gaussian distribution. Note thatthe posterior probabilities, or responsibilities, given by (9.13) appear naturally on1the right-hand side. Multiplying by Σ−k (which we assume to be nonsingular) andrearranging we obtainN1 γ(znk )xn(9.17)µk =Nkn=1where we have definedNk =Nn=1γ(znk ).(9.18)4369.
MIXTURE MODELS AND EMSection 2.3.4We can interpret Nk as the effective number of points assigned to cluster k. Notecarefully the form of this solution. We see that the mean µk for the k th Gaussiancomponent is obtained by taking a weighted mean of all of the points in the data set,in which the weighting factor for data point xn is given by the posterior probabilityγ(znk ) that component k was responsible for generating xn .If we set the derivative of ln p(X|π, µ, Σ) with respect to Σk to zero, and followa similar line of reasoning, making use of the result for the maximum likelihoodsolution for the covariance matrix of a single Gaussian, we obtainN1 Σk =γ(znk )(xn − µk )(xn − µk )TNk(9.19)n=1Appendix Ewhich has the same form as the corresponding result for a single Gaussian fitted tothe data set, but again with each data point weighted by the corresponding posterior probability and with the denominator given by the effective number of pointsassociated with the corresponding component.Finally, we maximize ln p(X|π, µ, Σ) with respect to the mixing coefficientsπk .
Here we must take account of the constraint (9.9), which requires the mixingcoefficients to sum to one. This can be achieved using a Lagrange multiplier andmaximizing the following quantityKπk − 1(9.20)ln p(X|π, µ, Σ) + λk=1which gives0=Nn=1N (xn |µk , Σk )+λj πj N (xn |µj , Σj )(9.21)where again we see the appearance of the responsibilities. If we now multiply bothsides by πk and sum over k making use of the constraint (9.9), we find λ = −N .Using this to eliminate λ and rearranging we obtainπk =NkN(9.22)so that the mixing coefficient for the k th component is given by the average responsibility which that component takes for explaining the data points.It is worth emphasizing that the results (9.17), (9.19), and (9.22) do not constitute a closed-form solution for the parameters of the mixture model because theresponsibilities γ(znk ) depend on those parameters in a complex way through (9.13).However, these results do suggest a simple iterative scheme for finding a solution tothe maximum likelihood problem, which as we shall see turns out to be an instanceof the EM algorithm for the particular case of the Gaussian mixture model.
Wefirst choose some initial values for the means, covariances, and mixing coefficients.Then we alternate between the following two updates that we shall call the E step4379.2. Mixtures of Gaussians222L=1000−2−2−2−20(a)22−20(b)22L=2−20−2−2−2(d)220(f)2L = 2000(c)2L=50−20−20(e)2−2Figure 9.8 Illustration of the EM algorithm using the Old Faithful set as used for the illustration of the K-meansalgorithm in Figure 9.1.
See the text for details.Section 9.4and the M step, for reasons that will become apparent shortly. In the expectationstep, or E step, we use the current values for the parameters to evaluate the posteriorprobabilities, or responsibilities, given by (9.13). We then use these probabilities inthe maximization step, or M step, to re-estimate the means, covariances, and mixing coefficients using the results (9.17), (9.19), and (9.22).
Note that in so doingwe first evaluate the new means using (9.17) and then use these new values to findthe covariances using (9.19), in keeping with the corresponding result for a singleGaussian distribution. We shall show that each update to the parameters resultingfrom an E step followed by an M step is guaranteed to increase the log likelihoodfunction. In practice, the algorithm is deemed to have converged when the changein the log likelihood function, or alternatively in the parameters, falls below somethreshold.
We illustrate the EM algorithm for a mixture of two Gaussians applied tothe rescaled Old Faithful data set in Figure 9.8. Here a mixture of two Gaussiansis used, with centres initialized using the same values as for the K-means algorithmin Figure 9.1, and with precision matrices initialized to be proportional to the unitmatrix. Plot (a) shows the data points in green, together with the initial configuration of the mixture model in which the one standard-deviation contours for the two4389.
MIXTURE MODELS AND EMGaussian components are shown as blue and red circles. Plot (b) shows the resultof the initial E step, in which each data point is depicted using a proportion of blueink equal to the posterior probability of having been generated from the blue component, and a corresponding proportion of red ink given by the posterior probabilityof having been generated by the red component.
Thus, points that have a significantprobability for belonging to either cluster appear purple. The situation after the firstM step is shown in plot (c), in which the mean of the blue Gaussian has moved tothe mean of the data set, weighted by the probabilities of each data point belongingto the blue cluster, in other words it has moved to the centre of mass of the blue ink.Similarly, the covariance of the blue Gaussian is set equal to the covariance of theblue ink. Analogous results hold for the red component. Plots (d), (e), and (f) showthe results after 2, 5, and 20 complete cycles of EM, respectively.
In plot (f) thealgorithm is close to convergence.Note that the EM algorithm takes many more iterations to reach (approximate)convergence compared with the K-means algorithm, and that each cycle requiressignificantly more computation. It is therefore common to run the K-means algorithm in order to find a suitable initialization for a Gaussian mixture model that issubsequently adapted using EM. The covariance matrices can conveniently be initialized to the sample covariances of the clusters found by the K-means algorithm,and the mixing coefficients can be set to the fractions of data points assigned to therespective clusters.
As with gradient-based approaches for maximizing the log likelihood, techniques must be employed to avoid singularities of the likelihood functionin which a Gaussian component collapses onto a particular data point. It should beemphasized that there will generally be multiple local maxima of the log likelihoodfunction, and that EM is not guaranteed to find the largest of these maxima. Becausethe EM algorithm for Gaussian mixtures plays such an important role, we summarizeit below.EM for Gaussian MixturesGiven a Gaussian mixture model, the goal is to maximize the likelihood functionwith respect to the parameters (comprising the means and covariances of thecomponents and the mixing coefficients).1.