Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 99
Текст из файла (страница 99)
Calculate the number ofdistinct directed trees that can be constructed from a given undirected tree.8.19 ( ) Apply the sum-product algorithm derived in Section 8.4.4 to the chain-ofnodes model discussed in Section 8.4.1 and show that the results (8.54), (8.55), and(8.57) are recovered as a special case.8.20 () www Consider the message passing protocol for the sum-product algorithm ona tree-structured factor graph in which messages are first propagated from the leavesto an arbitrarily chosen root node and then from the root node out to the leaves.
Useproof by induction to show that the messages can be passed in such an order thatat every step, each node that must send a message has received all of the incomingmessages necessary to construct its outgoing messages.8.21 ( ) www Show that the marginal distributions p(xs ) over the sets of variablesxs associated with each of the factors fx (xs ) in a factor graph can be found by firstrunning the sum-product message passing algorithm and then evaluating the requiredmarginals using (8.72).8.22 () Consider a tree-structured factor graph, in which a given subset of the variablenodes form a connected subgraph (i.e., any variable node of the subset is connectedto at least one of the other variable nodes via a single factor node). Show how thesum-product algorithm can be used to compute the marginal distribution over thatsubset.8.23 ( ) www In Section 8.4.4, we showed that the marginal distribution p(xi ) for avariable node xi in a factor graph is given by the product of the messages arriving atthis node from neighbouring factor nodes in the form (8.63).
Show that the marginalp(xi ) can also be written as the product of the incoming message along any one ofthe links with the outgoing message along the same link.8.24 ( ) Show that the marginal distribution for the variables xs in a factor fs (xs ) ina tree-structured factor graph, after running the sum-product message passing algorithm, can be written as the product of the message arriving at the factor node alongall its links, times the local factor f (xs ), in the form (8.72).4228. GRAPHICAL MODELS8.25 ( ) In (8.86), we verified that the sum-product algorithm run on the graph inFigure 8.51 with node x3 designated as the root node gives the correct marginal forx2 .
Show that the correct marginals are obtained also for x1 and x3 . Similarly, showthat the use of the result (8.72) after running the sum-product algorithm on this graphgives the correct joint distribution for x1 , x2 .8.26 () Consider a tree-structured factor graph over discrete variables, and suppose wewish to evaluate the joint distribution p(xa , xb ) associated with two variables xa andxb that do not belong to a common factor. Define a procedure for using the sumproduct algorithm to evaluate this joint distribution in which one of the variables issuccessively clamped to each of its allowed values.8.27 ( ) Consider two discrete variables x and y each having three possible states, forexample x, y ∈ {0, 1, 2}.
Construct a joint distribution p(x, y) over these variablesx that maximizes the marginal p(x), along withhaving the property that the value the value y that maximizes the marginal p(y), together have probability zero underthe joint distribution, so that p(x, y ) = 0.8.28 ( ) www The concept of a pending message in the sum-product algorithm fora factor graph was defined in Section 8.4.7. Show that if the graph has one or morecycles, there will always be at least one pending message irrespective of how longthe algorithm runs.8.29 ( ) www Show that if the sum-product algorithm is run on a factor graph with atree structure (no loops), then after a finite number of messages have been sent, therewill be no pending messages.9Mixture Modelsand EMSection 9.1If we define a joint distribution over observed and latent variables, the corresponding distribution of the observed variables alone is obtained by marginalization.
Thisallows relatively complex marginal distributions over observed variables to be expressed in terms of more tractable joint distributions over the expanded space ofobserved and latent variables. The introduction of latent variables thereby allowscomplicated distributions to be formed from simpler components. In this chapter,we shall see that mixture distributions, such as the Gaussian mixture discussed inSection 2.3.9, can be interpreted in terms of discrete latent variables. Continuouslatent variables will form the subject of Chapter 12.As well as providing a framework for building more complex probability distributions, mixture models can also be used to cluster data.
We therefore begin ourdiscussion of mixture distributions by considering the problem of finding clustersin a set of data points, which we approach first using a nonprobabilistic techniquecalled the K-means algorithm (Lloyd, 1982). Then we introduce the latent variable423424Section 9.2Section 9.3Section 9.49. MIXTURE MODELS AND EMview of mixture distributions in which the discrete latent variables can be interpretedas defining assignments of data points to specific components of the mixture. A general technique for finding maximum likelihood estimators in latent variable modelsis the expectation-maximization (EM) algorithm.
We first of all use the Gaussianmixture distribution to motivate the EM algorithm in a fairly informal way, and thenwe give a more careful treatment based on the latent variable viewpoint. We shallsee that the K-means algorithm corresponds to a particular nonprobabilistic limit ofEM applied to mixtures of Gaussians. Finally, we discuss EM in some generality.Gaussian mixture models are widely used in data mining, pattern recognition,machine learning, and statistical analysis.
In many applications, their parameters aredetermined by maximum likelihood, typically using the EM algorithm. However, aswe shall see there are some significant limitations to the maximum likelihood approach, and in Chapter 10 we shall show that an elegant Bayesian treatment can begiven using the framework of variational inference. This requires little additionalcomputation compared with EM, and it resolves the principal difficulties of maximum likelihood while also allowing the number of components in the mixture to beinferred automatically from the data.9.1. K-means ClusteringWe begin by considering the problem of identifying groups, or clusters, of data pointsin a multidimensional space. Suppose we have a data set {x1 , .
. . , xN } consistingof N observations of a random D-dimensional Euclidean variable x. Our goal is topartition the data set into some number K of clusters, where we shall suppose forthe moment that the value of K is given. Intuitively, we might think of a cluster ascomprising a group of data points whose inter-point distances are small comparedwith the distances to points outside of the cluster. We can formalize this notion byfirst introducing a set of D-dimensional vectors µk , where k = 1, . .
. , K, in whichµk is a prototype associated with the k th cluster. As we shall see shortly, we canthink of the µk as representing the centres of the clusters. Our goal is then to findan assignment of data points to clusters, as well as a set of vectors {µk }, such thatthe sum of the squares of the distances of each data point to its closest vector µk , isa minimum.It is convenient at this point to define some notation to describe the assignmentof data points to clusters. For each data point xn , we introduce a corresponding setof binary indicator variables rnk ∈ {0, 1}, where k = 1, .
. . , K describing which ofthe K clusters the data point xn is assigned to, so that if data point xn is assigned tocluster k then rnk = 1, and rnj = 0 for j = k. This is known as the 1-of-K codingscheme. We can then define an objective function, sometimes called a distortionmeasure, given byN Krnk xn − µk 2(9.1)J=n=1 k=1which represents the sum of the squares of the distances of each data point to its9.1. K-means ClusteringSection 9.4425assigned vector µk . Our goal is to find values for the {rnk } and the {µk } so as tominimize J.
We can do this through an iterative procedure in which each iterationinvolves two successive steps corresponding to successive optimizations with respectto the rnk and the µk . First we choose some initial values for the µk . Then in the firstphase we minimize J with respect to the rnk , keeping the µk fixed.
In the secondphase we minimize J with respect to the µk , keeping rnk fixed. This two-stageoptimization is then repeated until convergence. We shall see that these two stagesof updating rnk and updating µk correspond respectively to the E (expectation) andM (maximization) steps of the EM algorithm, and to emphasize this we shall use theterms E step and M step in the context of the K-means algorithm.Consider first the determination of the rnk . Because J in (9.1) is a linear function of rnk , this optimization can be performed easily to give a closed form solution.The terms involving different n are independent and so we can optimize for eachn separately by choosing rnk to be 1 for whichever value of k gives the minimumvalue of xn − µk 2 .
In other words, we simply assign the nth data point to theclosest cluster centre. More formally, this can be expressed as1 if k = arg minj xn − µj 2(9.2)rnk =0 otherwise.Now consider the optimization of the µk with the rnk held fixed. The objectivefunction J is a quadratic function of µk , and it can be minimized by setting itsderivative with respect to µk to zero giving2Nrnk (xn − µk ) = 0(9.3)n=1which we can easily solve for µk to given rnk xn.µk = n rnkExercise 9.1Appendix A(9.4)The denominator in this expression is equal to the number of points assigned tocluster k, and so this result has a simple interpretation, namely set µk equal to themean of all of the data points xn assigned to cluster k. For this reason, the procedureis known as the K-means algorithm.The two phases of re-assigning data points to clusters and re-computing the cluster means are repeated in turn until there is no further change in the assignments (oruntil some maximum number of iterations is exceeded).
Because each phase reducesthe value of the objective function J, convergence of the algorithm is assured. However, it may converge to a local rather than global minimum of J. The convergenceproperties of the K-means algorithm were studied by MacQueen (1967).The K-means algorithm is illustrated using the Old Faithful data set in Figure 9.1. For the purposes of this example, we have made a linear re-scaling of thedata, known as standardizing, such that each of the variables has zero mean andunit standard deviation. For this example, we have chosen K = 2, and so in this42629. MIXTURE MODELS AND EM(a)2(b)2000−2−2−2−2202(d)−2202(e)−22000−2−2−2−2202(g)−2202(h)200−2−2−202−202020202(f)−20−2(c)(i)−2Figure 9.1 Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green pointsdenote the data set in a two-dimensional Euclidean space.