Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 90
Текст из файла (страница 90)
Thisfollows from the fact that there is no direct path between the two nodes, and all otherpaths pass through nodes that are observed, and hence those paths are blocked. Thisconditional independence property can be expressed asp(xi , xj |x\{i,j} ) = p(xi |x\{i,j} )p(xj |x\{i,j} )(8.38)where x\{i,j} denotes the set x of all variables with xi and xj removed. The factorization of the joint distribution must therefore be such that xi and xj do not appearin the same factor in order for the conditional independence property to hold for allpossible distributions belonging to the graph.This leads us to consider a graphical concept called a clique, which is definedas a subset of the nodes in a graph such that there exists a link between all pairs ofnodes in the subset. In other words, the set of nodes in a clique is fully connected.Furthermore, a maximal clique is a clique such that it is not possible to include anyother nodes from the graph in the set without it ceasing to be a clique.
These conceptsare illustrated by the undirected graph over four variables shown in Figure 8.29. Thisgraph has five cliques of two nodes given by {x1 , x2 }, {x2 , x3 }, {x3 , x4 }, {x4 , x2 },and {x1 , x3 }, as well as two maximal cliques given by {x1 , x2 , x3 } and {x2 , x3 , x4 }.The set {x1 , x2 , x3 , x4 } is not a clique because of the missing link from x1 to x4 .We can therefore define the factors in the decomposition of the joint distributionto be functions of the variables in the cliques. In fact, we can consider functionsof the maximal cliques, without loss of generality, because other cliques must besubsets of maximal cliques. Thus, if {x1 , x2 , x3 } is a maximal clique and we definean arbitrary function over this clique, then including another factor defined over asubset of these variables would be redundant.Let us denote a clique by C and the set of variables in that clique by xC .
ThenFigure 8.29A four-node undirected graph showing a clique (outlined ingreen) and a maximal clique (outlined in blue).x1x2x3x43868. GRAPHICAL MODELSthe joint distribution is written as a product of potential functions ψC (xC ) over themaximal cliques of the graph1 p(x) =ψC (xC ).(8.39)ZCHere the quantity Z, sometimes called the partition function, is a normalization constant and is given byψC (xC )(8.40)Z=xCwhich ensures that the distribution p(x) given by (8.39) is correctly normalized.By considering only potential functions which satisfy ψC (xC ) 0 we ensure thatp(x) 0.
In (8.40) we have assumed that x comprises discrete variables, but theframework is equally applicable to continuous variables, or a combination of the two,in which the summation is replaced by the appropriate combination of summationand integration.Note that we do not restrict the choice of potential functions to those that have aspecific probabilistic interpretation as marginal or conditional distributions. This isin contrast to directed graphs in which each factor represents the conditional distribution of the corresponding variable, conditioned on the state of its parents.
However,in special cases, for instance where the undirected graph is constructed by startingwith a directed graph, the potential functions may indeed have such an interpretation,as we shall see shortly.One consequence of the generality of the potential functions ψC (xC ) is thattheir product will in general not be correctly normalized.
We therefore have to introduce an explicit normalization factor given by (8.40). Recall that for directedgraphs, the joint distribution was automatically normalized as a consequence of thenormalization of each of the conditional distributions in the factorization.The presence of this normalization constant is one of the major limitations ofundirected graphs. If we have a model with M discrete nodes each having K states,then the evaluation of the normalization term involves summing over K M states andso (in the worst case) is exponential in the size of the model.
The partition functionis needed for parameter learning because it will be a function of any parameters thatgovern the potential functions ψC (xC ). However, for evaluation of local conditionaldistributions, the partition function is not needed because a conditional is the ratioof two marginals, and the partition function cancels between numerator and denominator when evaluating this ratio. Similarly, for evaluating local marginal probabilities we can work with the unnormalized joint distribution and then normalize themarginals explicitly at the end.
Provided the marginals only involves a small numberof variables, the evaluation of their normalization coefficient will be feasible.So far, we have discussed the notion of conditional independence based on simple graph separation and we have proposed a factorization of the joint distributionthat is intended to correspond to this conditional independence structure. However,we have not made any formal connection between conditional independence andfactorization for undirected graphs.
To do so we need to restrict attention to potential functions ψC (xC ) that are strictly positive (i.e., never zero or negative for any8.3. Markov Random Fields387choice of xC ). Given this restriction, we can make a precise relationship betweenfactorization and conditional independence.To do this we again return to the concept of a graphical model as a filter, corresponding to Figure 8.25. Consider the set of all possible distributions defined overa fixed set of variables corresponding to the nodes of a particular undirected graph.We can define UI to be the set of such distributions that are consistent with the setof conditional independence statements that can be read from the graph using graphseparation.
Similarly, we can define UF to be the set of such distributions that canbe expressed as a factorization of the form (8.39) with respect to the maximal cliquesof the graph. The Hammersley-Clifford theorem (Clifford, 1990) states that the setsUI and UF are identical.Because we are restricted to potential functions which are strictly positive it isconvenient to express them as exponentials, so thatψC (xC ) = exp {−E(xC )}(8.41)where E(xC ) is called an energy function, and the exponential representation iscalled the Boltzmann distribution. The joint distribution is defined as the product ofpotentials, and so the total energy is obtained by adding the energies of each of themaximal cliques.In contrast to the factors in the joint distribution for a directed graph, the potentials in an undirected graph do not have a specific probabilistic interpretation.Although this gives greater flexibility in choosing the potential functions, becausethere is no normalization constraint, it does raise the question of how to motivate achoice of potential function for a particular application.
This can be done by viewing the potential function as expressing which configurations of the local variablesare preferred to others. Global configurations that have a relatively high probabilityare those that find a good balance in satisfying the (possibly conflicting) influencesof the clique potentials. We turn now to a specific example to illustrate the use ofundirected graphs.8.3.3 Illustration: Image de-noisingWe can illustrate the application of undirected graphs using an example of noiseremoval from a binary image (Besag, 1974; Geman and Geman, 1984; Besag, 1986).Although a very simple example, this is typical of more sophisticated applications.Let the observed noisy image be described by an array of binary pixel values yi ∈{−1, +1}, where the index i = 1, .
. . , D runs over all pixels. We shall supposethat the image is obtained by taking an unknown noise-free image, described bybinary pixel values xi ∈ {−1, +1} and randomly flipping the sign of pixels withsome small probability. An example binary image, together with a noise corruptedimage obtained by flipping the sign of the pixels with probability 10%, is shown inFigure 8.30. Given the noisy image, our goal is to recover the original noise-freeimage.Because the noise level is small, we know that there will be a strong correlationbetween xi and yi .
We also know that neighbouring pixels xi and xj in an imageare strongly correlated. This prior knowledge can be captured using the Markov3888. GRAPHICAL MODELSFigure 8.30 Illustration of image de-noising using a Markov random field. The top row shows the originalbinary image on the left and the corrupted image after randomly changing 10% of the pixels on the right. Thebottom row shows the restored images obtained using iterated conditional models (ICM) on the left and usingthe graph-cut algorithm on the right.
ICM produces an image where 96% of the pixels agree with the originalimage, whereas the corresponding number for graph-cut is 99%.random field model whose undirected graph is shown in Figure 8.31. This graph hastwo types of cliques, each of which contains two variables. The cliques of the form{xi , yi } have an associated energy function that expresses the correlation betweenthese variables. We choose a very simple energy function for these cliques of theform −ηxi yi where η is a positive constant. This has the desired effect of giving alower energy (thus encouraging a higher probability) when xi and yi have the samesign and a higher energy when they have the opposite sign.The remaining cliques comprise pairs of variables {xi , xj } where i and j areindices of neighbouring pixels. Again, we want the energy to be lower when thepixels have the same sign than when they have the opposite sign, and so we choosean energy given by −βxi xj where β is a positive constant.Because a potential function is an arbitrary, nonnegative function over a maximalclique, we can multiply it by any nonnegative functions of subsets of the clique, or8.3.