Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 96
Текст из файла (страница 96)
However, there is nothing specific to discrete variables either in the graphicalframework or in the probabilistic construction of the sum-product algorithm. For8.4. Inference in Graphical ModelsTable 8.1Section 13.3Example of a joint distribution over two binary variables forwhich the maximum of the joint distribution occurs for different variable values compared to the maxima of the twomarginals.y=0y=1x=00.30.3411x=10.40.0continuous variables the summations are simply replaced by integrations.
We shallgive an example of the sum-product algorithm applied to a graph of linear-Gaussianvariables when we consider linear dynamical systems.8.4.5 The max-sum algorithmThe sum-product algorithm allows us to take a joint distribution p(x) expressedas a factor graph and efficiently find marginals over the component variables. Twoother common tasks are to find a setting of the variables that has the largest probability and to find the value of that probability. These can be addressed through aclosely related algorithm called max-sum, which can be viewed as an application ofdynamic programming in the context of graphical models (Cormen et al., 2001).A simple approach to finding latent variable values having high probabilitywould be to run the sum-product algorithm to obtain the marginals p(xi ) for every variable, and then, for each marginal in turn, to find the value xi that maximizesthat marginal.
However, this would give the set of values that are individually themost probable. In practice, we typically wish to find the set of values that jointlyhave the largest probability, in other words the vector xmax that maximizes the jointdistribution, so that(8.87)xmax = arg max p(x)xfor which the corresponding value of the joint probability will be given byp(xmax ) = max p(x).xExercise 8.27(8.88)In general, xmax is not the same as the set of xi values, as we can easily show usinga simple example. Consider the joint distribution p(x, y) over two binary variablesx, y ∈ {0, 1} given in Table 8.1.
The joint distribution is maximized by setting x =1 and y = 0, corresponding the value 0.4. However, the marginal for p(x), obtainedby summing over both values of y, is given by p(x = 0) = 0.6 and p(x = 1) = 0.4,and similarly the marginal for y is given by p(y = 0) = 0.7 and p(y = 1) = 0.3,and so the marginals are maximized by x = 0 and y = 0, which corresponds to avalue of 0.3 for the joint distribution. In fact, it is not difficult to construct examplesfor which the set of individually most probable values has probability zero under thejoint distribution.We therefore seek an efficient algorithm for finding the value of x that maximizes the joint distribution p(x) and that will allow us to obtain the value of thejoint distribution at its maximum.
To address the second of these problems, we shallsimply write out the max operator in terms of its componentsmax p(x) = max . . . max p(x)xx1xM(8.89)4128. GRAPHICAL MODELSwhere M is the total number of variables, and then substitute for p(x) using itsexpansion in terms of a product of factors. In deriving the sum-product algorithm,we made use of the distributive law (8.53) for multiplication. Here we make use ofthe analogous law for the max operatormax(ab, ac) = a max(b, c)(8.90)which holds if a 0 (as will always be the case for the factors in a graphical model).This allows us to exchange products with maximizations.Consider first the simple example of a chain of nodes described by (8.49).
Theevaluation of the probability maximum can be written as1max · · · max [ψ1,2 (x1 , x2 ) · · · ψN −1,N (xN −1 , xN )]xNZ x11max ψ1,2 (x1 , x2 ) · · · max ψN −1,N (xN −1 , xN ) .xNZ x1max p(x) =x=As with the calculation of marginals, we see that exchanging the max and productoperators results in a much more efficient computation, and one that is easily interpreted in terms of messages passed from node xN backwards along the chain to nodex1 .We can readily generalize this result to arbitrary tree-structured factor graphsby substituting the expression (8.59) for the factor graph expansion into (8.89) andagain exchanging maximizations with products. The structure of this calculation isidentical to that of the sum-product algorithm, and so we can simply translate thoseresults into the present context. In particular, suppose that we designate a particularvariable node as the ‘root’ of the graph.
Then we start a set of messages propagatinginwards from the leaves of the tree towards the root, with each node sending itsmessage towards the root once it has received all incoming messages from its otherneighbours. The final maximization is performed over the product of all messagesarriving at the root node, and gives the maximum value for p(x). This could be calledthe max-product algorithm and is identical to the sum-product algorithm except thatsummations are replaced by maximizations.
Note that at this stage, messages havebeen sent from leaves to the root, but not in the other direction.In practice, products of many small probabilities can lead to numerical underflow problems, and so it is convenient to work with the logarithm of the joint distribution. The logarithm is a monotonic function, so that if a > b then ln a > ln b, andhence the max operator and the logarithm function can be interchanged, so that(8.91)ln max p(x) = max ln p(x).xxThe distributive property is preserved becausemax(a + b, a + c) = a + max(b, c).(8.92)Thus taking the logarithm simply has the effect of replacing the products in themax-product algorithm with sums, and so we obtain the max-sum algorithm.
From8.4. Inference in Graphical Models413the results (8.66) and (8.69) derived earlier for the sum-product algorithm, we canreadily write down the max-sum algorithm in terms of message passing simply byreplacing ‘sum’ with ‘max’ and replacing products with sums of logarithms to give⎡⎤µf →x (x) =max ⎣ln f (x, x1 , . . . , xM ) +µxm →f (xm )⎦ (8.93)x1 ,...,xMµx→f (x) =m∈ne(fs )\xµfl →x (x).(8.94)l∈ne(x)\fThe initial messages sent by the leaf nodes are obtained by analogy with (8.70) and(8.71) and are given byµx→f (x) = 0µf →x (x) = ln f (x)(8.95)(8.96)while at the root node the maximum probability can then be computed, by analogywith (8.63), using⎡⎤µfs →x (x)⎦ .(8.97)pmax = max ⎣xs∈ne(x)So far, we have seen how to find the maximum of the joint distribution by propagating messages from the leaves to an arbitrarily chosen root node.
The result willbe the same irrespective of which node is chosen as the root. Now we turn to thesecond problem of finding the configuration of the variables for which the joint distribution attains this maximum value. So far, we have sent messages from the leavesto the root. The process of evaluating (8.97) will also give the value xmax for themost probable value of the root node variable, defined by⎡⎤µfs →x (x)⎦ .(8.98)xmax = arg max ⎣xs∈ne(x)At this point, we might be tempted simply to continue with the message passing algorithm and send messages from the root back out to the leaves, using (8.93) and(8.94), then apply (8.98) to all of the remaining variable nodes. However, becausewe are now maximizing rather than summing, it is possible that there may be multiple configurations of x all of which give rise to the maximum value for p(x).
Insuch cases, this strategy can fail because it is possible for the individual variablevalues obtained by maximizing the product of messages at each node to belong todifferent maximizing configurations, giving an overall configuration that no longercorresponds to a maximum.The problem can be resolved by adopting a rather different kind of messagepassing from the root node to the leaves. To see how this works, let us return onceagain to the simple chain example of N variables x1 , . . .
, xN each having K states,4148. GRAPHICAL MODELSFigure 8.53 A lattice, or trellis, diagram showing explicitly the K possible states (one per rowof the diagram) for each of the variables xn in the k = 1chain model. In this illustration K = 3. The arrow shows the direction of message passing in themax-product algorithm.
For every state k of eachvariable xn (corresponding to column n of the diagram) the function φ(xn ) defines a unique state atthe previous variable, indicated by the black lines. k = 2The two paths through the lattice correspond toconfigurations that give the global maximum of thejoint probability distribution, and either of thesecan be found by tracing back along the black linesin the opposite direction to the arrow.k=3n−2n−1nn+1corresponding to the graph shown in Figure 8.38. Suppose we take node xN to bethe root node. Then in the first phase, we propagate messages from the leaf node x1to the root node usingµxn →fn,n+1 (xn ) = µfn−1,n →xn (xn )µfn−1,n →xn (xn ) = max ln fn−1,n (xn−1 , xn ) + µxn−1 →f n−1,n (xn )xn−1which follow from applying (8.94) and (8.93) to this particular graph.
The initialmessage sent from the leaf node is simplyµx1 →f1,2 (x1 ) = 0.The most probable value for xN is then given by= arg max µfN −1,N →xN (xN ) .xmaxNxN(8.99)(8.100)Now we need to determine the states of the previous variables that correspond to thesame maximizing configuration. This can be done by keeping track of which valuesof the variables gave rise to the maximum state of each variable, in other words bystoring quantities given byφ(xn ) = arg max ln fn−1,n (xn−1 , xn ) + µxn−1 →f n−1,n (xn ) .(8.101)xn−1To understand better what is happening, it is helpful to represent the chain of variables in terms of a lattice or trellis diagram as shown in Figure 8.53. Note that thisis not a probabilistic graphical model because the nodes represent individual statesof variables, while each variable corresponds to a column of such states in the diagram.
For each state of a given variable, there is a unique state of the previousvariable that maximizes the probability (ties are broken either systematically or atrandom), corresponding to the function φ(xn ) given by (8.101), and this is indicated8.4. Inference in Graphical Models415by the lines connecting the nodes. Once we know the most probable value of the final node xN , we can then simply follow the link back to find the most probable stateof node xN −1 and so on back to the initial node x1 .