Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 101
Текст из файла (страница 101)
The values of zk therefore satisfy zk ∈ {0, 1} and k zk = 1, and wesee that there are K possible states for the vector z according to which element isnonzero. We shall define the joint distribution p(x, z) in terms of a marginal distribution p(z) and a conditional distribution p(x|z), corresponding to the graphicalmodel in Figure 9.4. The marginal distribution over z is specified in terms of themixing coefficients πk , such thatp(zk = 1) = πk9.2. Mixtures of GaussiansFigure 9.4Graphical representation of a mixture model, in whichthe joint distribution is expressed in the form p(x, z) =p(z)p(x|z).431zxwhere the parameters {πk } must satisfy0 πk 1together withK(9.8)πk = 1(9.9)k=1in order to be valid probabilities.
Because z uses a 1-of-K representation, we canalso write this distribution in the formp(z) =Kπkzk .(9.10)k=1Similarly, the conditional distribution of x given a particular value for z is a Gaussianp(x|zk = 1) = N (x|µk , Σk )which can also be written in the formp(x|z) =KN (x|µk , Σk )zk .(9.11)k=1Exercise 9.3The joint distribution is given by p(z)p(x|z), and the marginal distribution of x isthen obtained by summing the joint distribution over all possible states of z to givep(x) =zp(z)p(x|z) =Kπk N (x|µk , Σk )(9.12)k=1where we have made use of (9.10) and (9.11). Thus the marginal distribution of x isa Gaussian mixture of the form (9.7).
If we have several observations x1 , . . . , xN ,then, because we have represented the marginal distribution in the form p(x) =z p(x, z), it follows that for every observed data point xn there is a correspondinglatent variable zn .We have therefore found an equivalent formulation of the Gaussian mixture involving an explicit latent variable. It might seem that we have not gained muchby doing so. However, we are now able to work with the joint distribution p(x, z)4329. MIXTURE MODELS AND EMinstead of the marginal distribution p(x), and this will lead to significant simplifications, most notably through the introduction of the expectation-maximization (EM)algorithm.Another quantity that will play an important role is the conditional probabilityof z given x. We shall use γ(zk ) to denote p(zk = 1|x), whose value can be foundusing Bayes’ theoremγ(zk ) ≡ p(zk = 1|x) =p(zk = 1)p(x|zk = 1)Kp(zj = 1)p(x|zj = 1)j =1=πk N (x|µk , Σk )K.(9.13)πj N (x|µj , Σj )j =1Section 8.1.2We shall view πk as the prior probability of zk = 1, and the quantity γ(zk ) as thecorresponding posterior probability once we have observed x.
As we shall see later,γ(zk ) can also be viewed as the responsibility that component k takes for ‘explaining’ the observation x.We can use the technique of ancestral sampling to generate random samplesdistributed according to the Gaussian mixture model. To do this, we first generate az, from the marginal distribution p(z) and then generatevalue for z, which we denote a value for x from the conditional distribution p(x|z). Techniques for sampling fromstandard distributions are discussed in Chapter 11. We can depict samples from thejoint distribution p(x, z) by plotting points at the corresponding values of x andthen colouring them according to the value of z, in other words according to whichGaussian component was responsible for generating them, as shown in Figure 9.5(a).Similarly samples from the marginal distribution p(x) are obtained by taking thesamples from the joint distribution and ignoring the values of z.
These are illustratedin Figure 9.5(b) by plotting the x values without any coloured labels.We can also use this synthetic data set to illustrate the ‘responsibilities’ by evaluating, for every data point, the posterior probability for each component in themixture distribution from which this data set was generated. In particular, we canrepresent the value of the responsibilities γ(znk ) associated with data point xn byplotting the corresponding point using proportions of red, blue, and green ink givenby γ(znk ) for k = 1, 2, 3, respectively, as shown in Figure 9.5(c).
So, for instance,a data point for which γ(zn1 ) = 1 will be coloured red, whereas one for whichγ(zn2 ) = γ(zn3 ) = 0.5 will be coloured with equal proportions of blue and greenink and so will appear cyan. This should be compared with Figure 9.5(a) in whichthe data points were labelled using the true identity of the component from whichthey were generated.9.2.1 Maximum likelihoodSuppose we have a data set of observations {x1 , .
. . , xN }, and we wish to modelthis data using a mixture of Gaussians. We can represent this data set as an N × D4339.2. Mixtures of Gaussians11(a)1(b)0.50.50.500000.5100.5(c)100.51Figure 9.5 Example of 500 points drawn from the mixture of 3 Gaussians shown in Figure 2.23. (a) Samplesfrom the joint distribution p(z)p(x|z) in which the three states of z, corresponding to the three components of themixture, are depicted in red, green, and blue, and (b) the corresponding samples from the marginal distributionp(x), which is obtained by simply ignoring the values of z and just plotting the x values. The data set in (a) issaid to be complete, whereas that in (b) is incomplete.
(c) The same samples in which the colours represent thevalue of the responsibilities γ(znk ) associated with data point xn , obtained by plotting the corresponding pointusing proportions of red, blue, and green ink given by γ(znk ) for k = 1, 2, 3, respectivelymatrix X in which the nth row is given by xTn . Similarly, the corresponding latentvariables will be denoted by an N × K matrix Z with rows zTn . If we assume thatthe data points are drawn independently from the distribution, then we can expressthe Gaussian mixture model for this i.i.d. data set using the graphical representationshown in Figure 9.6. From (9.7) the log of the likelihood function is given byKNlnπk N (xn |µk , Σk ) .(9.14)ln p(X|π, µ, Σ) =n=1k=1Before discussing how to maximize this function, it is worth emphasizing thatthere is a significant problem associated with the maximum likelihood frameworkapplied to Gaussian mixture models, due to the presence of singularities.
For simplicity, consider a Gaussian mixture whose components have covariance matricesgiven by Σk = σk2 I, where I is the unit matrix, although the conclusions will holdfor general covariance matrices. Suppose that one of the components of the mixturemodel, let us say the j th component, has its mean µj exactly equal to one of the dataFigure 9.6Graphical representation of a Gaussian mixture modelfor a set of N i.i.d. data points {xn }, with correspondingπlatent points {zn }, where n = 1, . .
. , N .znxnµΣN4349. MIXTURE MODELS AND EMFigure 9.7Illustration of how singularities in thelikelihood function arise with mixturesof Gaussians. This should be com- p(x)pared with the case of a single Gaussian shown in Figure 1.14 for which nosingularities arise.xpoints so that µj = xn for some value of n. This data point will then contribute aterm in the likelihood function of the form11.(9.15)N (xn |xn , σj2 I) =(2π)1/2 σjSection 10.1If we consider the limit σj → 0, then we see that this term goes to infinity andso the log likelihood function will also go to infinity.
Thus the maximization ofthe log likelihood function is not a well posed problem because such singularitieswill always be present and will occur whenever one of the Gaussian components‘collapses’ onto a specific data point. Recall that this problem did not arise in thecase of a single Gaussian distribution. To understand the difference, note that if asingle Gaussian collapses onto a data point it will contribute multiplicative factorsto the likelihood function arising from the other data points and these factors will goto zero exponentially fast, giving an overall likelihood that goes to zero rather thaninfinity.
However, once we have (at least) two components in the mixture, one ofthe components can have a finite variance and therefore assign finite probability toall of the data points while the other component can shrink onto one specific datapoint and thereby contribute an ever increasing additive value to the log likelihood.This is illustrated in Figure 9.7. These singularities provide another example of thesevere over-fitting that can occur in a maximum likelihood approach. We shall seethat this difficulty does not occur if we adopt a Bayesian approach. For the moment,however, we simply note that in applying maximum likelihood to Gaussian mixturemodels we must take steps to avoid finding such pathological solutions and insteadseek local maxima of the likelihood function that are well behaved. We can hope toavoid the singularities by using suitable heuristics, for instance by detecting when aGaussian component is collapsing and resetting its mean to a randomly chosen valuewhile also resetting its covariance to some large value, and then continuing with theoptimization.A further issue in finding maximum likelihood solutions arises from the factthat for any given maximum likelihood solution, a K-component mixture will havea total of K! equivalent solutions corresponding to the K! ways of assigning Ksets of parameters to K components.