Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 97
Текст из файла (страница 97)
This corresponds to propagatinga message back down the chain usingmaxxmaxn−1 = φ(xn )Section 13.2(8.102)and is known as back-tracking. Note that there could be several values of xn−1 allof which give the maximum value in (8.101). Provided we chose one of these valueswhen we do the back-tracking, we are assured of a globally consistent maximizingconfiguration.In Figure 8.53, we have indicated two paths, each of which we shall supposecorresponds to a global maximum of the joint probability distribution. If k = 2and k = 3 each represent possible values of xmaxN , then starting from either stateand tracing back along the black lines, which corresponds to iterating (8.102), weobtain a valid global maximum configuration.
Note that if we had run a forwardpass of max-sum message passing followed by a backward pass and then applied(8.98) at each node separately, we could end up selecting some states from one pathand some from the other path, giving an overall configuration that is not a globalmaximizer. We see that it is necessary instead to keep track of the maximizing statesduring the forward pass using the functions φ(xn ) and then use back-tracking to finda consistent solution.The extension to a general tree-structured factor graph should now be clear. Ifa message is sent from a factor node f to a variable node x, a maximization isperformed over all other variable nodes x1 , .
. . , xM that are neighbours of that factor node, using (8.93). When we perform this maximization, we keep a record ofwhich values of the variables x1 , . . . , xM gave rise to the maximum. Then in theback-tracking step, having found xmax , we can then use these stored values to assign consistent maximizing states xmax, . . . , xmax1M . The max-sum algorithm, withback-tracking, gives an exact maximizing configuration for the variables providedthe factor graph is a tree.
An important application of this technique is for findingthe most probable sequence of hidden states in a hidden Markov model, in whichcase it is known as the Viterbi algorithm.As with the sum-product algorithm, the inclusion of evidence in the form ofobserved variables is straightforward. The observed variables are clamped to theirobserved values, and the maximization is performed over the remaining hidden variables. This can be shown formally by including identity functions for the observedvariables into the factor functions, as we did for the sum-product algorithm.It is interesting to compare max-sum with the iterated conditional modes (ICM)algorithm described on page 389.
Each step in ICM is computationally simpler because the ‘messages’ that are passed from one node to the next comprise a singlevalue consisting of the new state of the node for which the conditional distributionis maximized. The max-sum algorithm is more complex because the messages arefunctions of node variables x and hence comprise a set of K values for each possible state of x. Unlike max-sum, however, ICM is not guaranteed to find a globalmaximum even for tree-structured graphs.4168.
GRAPHICAL MODELS8.4.6 Exact inference in general graphsThe sum-product and max-sum algorithms provide efficient and exact solutionsto inference problems in tree-structured graphs. For many practical applications,however, we have to deal with graphs having loops.The message passing framework can be generalized to arbitrary graph topologies, giving an exact inference procedure known as the junction tree algorithm (Lauritzen and Spiegelhalter, 1988; Jordan, 2007). Here we give a brief outline of thekey steps involved.
This is not intended to convey a detailed understanding of thealgorithm, but rather to give a flavour of the various stages involved. If the startingpoint is a directed graph, it is first converted to an undirected graph by moralization, whereas if starting from an undirected graph this step is not required. Next thegraph is triangulated, which involves finding chord-less cycles containing four ormore nodes and adding extra links to eliminate such chord-less cycles.
For instance,in the graph in Figure 8.36, the cycle A–C–B–D–A is chord-less a link could beadded between A and B or alternatively between C and D. Note that the joint distribution for the resulting triangulated graph is still defined by a product of the samepotential functions, but these are now considered to be functions over expanded setsof variables. Next the triangulated graph is used to construct a new tree-structuredundirected graph called a join tree, whose nodes correspond to the maximal cliquesof the triangulated graph, and whose links connect pairs of cliques that have variables in common.
The selection of which pairs of cliques to connect in this way isimportant and is done so as to give a maximal spanning tree defined as follows. Ofall possible trees that link up the cliques, the one that is chosen is one for which theweight of the tree is largest, where the weight for a link is the number of nodes sharedby the two cliques it connects, and the weight for the tree is the sum of the weightsfor the links. If the tree is condensed, so that any clique that is a subset of anotherclique is absorbed into the larger clique, this gives a junction tree. As a consequenceof the triangulation step, the resulting tree satisfies the running intersection property,which means that if a variable is contained in two cliques, then it must also be contained in every clique on the path that connects them.
This ensures that inferenceabout variables will be consistent across the graph. Finally, a two-stage messagepassing algorithm, essentially equivalent to the sum-product algorithm, can now beapplied to this junction tree in order to find marginals and conditionals. Althoughthe junction tree algorithm sounds complicated, at its heart is the simple idea thatwe have used already of exploiting the factorization properties of the distribution toallow sums and products to be interchanged so that partial summations can be performed, thereby avoiding having to work directly with the joint distribution.
Therole of the junction tree is to provide a precise and efficient way to organize thesecomputations. It is worth emphasizing that this is achieved using purely graphicaloperations!The junction tree is exact for arbitrary graphs and is efficient in the sense thatfor a given graph there does not in general exist a computationally cheaper approach.Unfortunately, the algorithm must work with the joint distributions within each node(each of which corresponds to a clique of the triangulated graph) and so the computational cost of the algorithm is determined by the number of variables in the largest8.4.
Inference in Graphical Models417clique and will grow exponentially with this number in the case of discrete variables.An important concept is the treewidth of a graph (Bodlaender, 1993), which is defined in terms of the number of variables in the largest clique. In fact, it is defined tobe as one less than the size of the largest clique, to ensure that a tree has a treewidthof 1. Because there in general there can be multiple different junction trees that canbe constructed from a given starting graph, the treewidth is defined by the junctiontree for which the largest clique has the fewest variables.
If the treewidth of theoriginal graph is high, the junction tree algorithm becomes impractical.8.4.7 Loopy belief propagationFor many problems of practical interest, it will not be feasible to use exact inference, and so we need to exploit effective approximation methods. An importantclass of such approximations, that can broadly be called variational methods, will bediscussed in detail in Chapter 10. Complementing these deterministic approaches isa wide range of sampling methods, also called Monte Carlo methods, that are basedon stochastic numerical sampling from distributions and that will be discussed atlength in Chapter 11.Here we consider one simple approach to approximate inference in graphs withloops, which builds directly on the previous discussion of exact inference in trees.The idea is simply to apply the sum-product algorithm even though there is no guarantee that it will yield good results.
This approach is known as loopy belief propagation (Frey and MacKay, 1998) and is possible because the message passing rules(8.66) and (8.69) for the sum-product algorithm are purely local. However, becausethe graph now has cycles, information can flow many times around the graph. Forsome models, the algorithm will converge, whereas for others it will not.In order to apply this approach, we need to define a message passing schedule.Let us assume that one message is passed at a time on any given link and in anygiven direction. Each message sent from a node replaces any previous message sentin the same direction across the same link and will itself be a function only of themost recent messages received by that node at previous steps of the algorithm.We have seen that a message can only be sent across a link from a node whenall other messages have been received by that node across its other links.