Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 91
Текст из файла (страница 91)
Markov Random FieldsFigure 8.31An undirected graphical model representing aMarkov random field for image de-noising, inwhich xi is a binary variable denoting the stateof pixel i in the unknown noise-free image, and yidenotes the corresponding value of pixel i in theobserved noisy image.389yixiequivalently we can add the corresponding energies. In this example, this allows usto add an extra term hxi for each pixel i in the noise-free image. Such a term hasthe effect of biasing the model towards pixel values that have one particular sign inpreference to the other.The complete energy function for the model then takes the formxi − βxi xj − ηxi yi(8.42)E(x, y) = hi{i,j}iwhich defines a joint distribution over x and y given byp(x, y) =Exercise 8.131exp{−E(x, y)}.Z(8.43)We now fix the elements of y to the observed values given by the pixels of thenoisy image, which implicitly defines a conditional distribution p(x|y) over noisefree images.
This is an example of the Ising model, which has been widely studied instatistical physics. For the purposes of image restoration, we wish to find an image xhaving a high probability (ideally the maximum probability). To do this we shall usea simple iterative technique called iterated conditional modes, or ICM (Kittler andFöglein, 1984), which is simply an application of coordinate-wise gradient ascent.The idea is first to initialize the variables {xi }, which we do by simply setting xi =yi for all i. Then we take one node xj at a time and we evaluate the total energyfor the two possible states xj = +1 and xj = −1, keeping all other node variablesfixed, and set xj to whichever state has the lower energy. This will either leavethe probability unchanged, if xj is unchanged, or will increase it.
Because onlyone variable is changed, this is a simple local computation that can be performedefficiently. We then repeat the update for another site, and so on, until some suitablestopping criterion is satisfied. The nodes may be updated in a systematic way, forinstance by repeatedly raster scanning through the image, or by choosing nodes atrandom.If we have a sequence of updates in which every site is visited at least once,and in which no changes to the variables are made, then by definition the algorithm3908. GRAPHICAL MODELSx1Figure 8.32 (a) Example of a directedgraph.(b) The equivalent undirected (a)graph.x2xN −1xNx1x2xNxN −1(b)Exercise 8.14Section 8.4will have converged to a local maximum of the probability. This need not, however,correspond to the global maximum.For the purposes of this simple illustration, we have fixed the parameters to beβ = 1.0, η = 2.1 and h = 0.
Note that leaving h = 0 simply means that the priorprobabilities of the two states of xi are equal. Starting with the observed noisy imageas the initial configuration, we run ICM until convergence, leading to the de-noisedimage shown in the lower left panel of Figure 8.30. Note that if we set β = 0,which effectively removes the links between neighbouring pixels, then the globalmost probable solution is given by xi = yi for all i, corresponding to the observednoisy image.Later we shall discuss a more effective algorithm for finding high probability solutions called the max-product algorithm, which typically leads to better solutions,although this is still not guaranteed to find the global maximum of the posterior distribution.
However, for certain classes of model, including the one given by (8.42),there exist efficient algorithms based on graph cuts that are guaranteed to find theglobal maximum (Greig et al., 1989; Boykov et al., 2001; Kolmogorov and Zabih,2004).
The lower right panel of Figure 8.30 shows the result of applying a graph-cutalgorithm to the de-noising problem.8.3.4 Relation to directed graphsWe have introduced two graphical frameworks for representing probability distributions, corresponding to directed and undirected graphs, and it is instructive todiscuss the relation between these. Consider first the problem of taking a model thatis specified using a directed graph and trying to convert it to an undirected graph. Insome cases this is straightforward, as in the simple example in Figure 8.32.
Here thejoint distribution for the directed graph is given as a product of conditionals in theform(8.44)p(x) = p(x1 )p(x2 |x1 )p(x3 |x2 ) · · · p(xN |xN −1 ).Now let us convert this to an undirected graph representation, as shown in Figure 8.32. In the undirected graph, the maximal cliques are simply the pairs of neighbouring nodes, and so from (8.39) we wish to write the joint distribution in the formp(x) =1ψ1,2 (x1 , x2 )ψ2,3 (x2 , x3 ) · · · ψN −1,N (xN −1 , xN ).Z(8.45)3918.3. Markov Random FieldsFigure 8.33 Example of a simpledirected graph (a) and the corresponding moral graph (b).x1x3x1x2x3x2x4(a)x4(b)This is easily done by identifyingψ1,2 (x1 , x2 ) = p(x1 )p(x2 |x1 )ψ2,3 (x2 , x3 ) = p(x3 |x2 )...ψN −1,N (xN −1 , xN ) = p(xN |xN −1 )where we have absorbed the marginal p(x1 ) for the first node into the first potentialfunction. Note that in this case, the partition function Z = 1.Let us consider how to generalize this construction, so that we can convert anydistribution specified by a factorization over a directed graph into one specified by afactorization over an undirected graph.
This can be achieved if the clique potentialsof the undirected graph are given by the conditional distributions of the directedgraph. In order for this to be valid, we must ensure that the set of variables thatappears in each of the conditional distributions is a member of at least one clique ofthe undirected graph. For nodes on the directed graph having just one parent, this isachieved simply by replacing the directed link with an undirected link.
However, fornodes in the directed graph having more than one parent, this is not sufficient. Theseare nodes that have ‘head-to-head’ paths encountered in our discussion of conditionalindependence. Consider a simple directed graph over 4 nodes shown in Figure 8.33.The joint distribution for the directed graph takes the formp(x) = p(x1 )p(x2 )p(x3 )p(x4 |x1 , x2 , x3 ).(8.46)We see that the factor p(x4 |x1 , x2 , x3 ) involves the four variables x1 , x2 , x3 , andx4 , and so these must all belong to a single clique if this conditional distribution isto be absorbed into a clique potential. To ensure this, we add extra links betweenall pairs of parents of the node x4 . Anachronistically, this process of ‘marryingthe parents’ has become known as moralization, and the resulting undirected graph,after dropping the arrows, is called the moral graph.
It is important to observe thatthe moral graph in this example is fully connected and so exhibits no conditionalindependence properties, in contrast to the original directed graph.Thus in general to convert a directed graph into an undirected graph, we first addadditional undirected links between all pairs of parents for each node in the graph and3928. GRAPHICAL MODELSSection 8.4Section 8.2Figure 8.34then drop the arrows on the original links to give the moral graph. Then we initializeall of the clique potentials of the moral graph to 1. We then take each conditionaldistribution factor in the original directed graph and multiply it into one of the cliquepotentials.
There will always exist at least one maximal clique that contains all ofthe variables in the factor as a result of the moralization step. Note that in all casesthe partition function is given by Z = 1.The process of converting a directed graph into an undirected graph plays animportant role in exact inference techniques such as the junction tree algorithm.Converting from an undirected to a directed representation is much less commonand in general presents problems due to the normalization constraints.We saw that in going from a directed to an undirected representation we had todiscard some conditional independence properties from the graph.
Of course, wecould always trivially convert any distribution over a directed graph into one over anundirected graph by simply using a fully connected undirected graph. This would,however, discard all conditional independence properties and so would be vacuous.The process of moralization adds the fewest extra links and so retains the maximumnumber of independence properties.We have seen that the procedure for determining the conditional independenceproperties is different between directed and undirected graphs. It turns out that thetwo types of graph can express different conditional independence properties, andit is worth exploring this issue in more detail. To do so, we return to the view ofa specific (directed or undirected) graph as a filter, so that the set of all possibledistributions over the given variables could be reduced to a subset that respects theconditional independencies implied by the graph.