Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 84
Текст из файла (страница 84)
Thus, for a graph with K nodes, the jointdistribution is given byKp(x) =p(xk |pak )(8.5)k=1Exercise 8.1Exercise 8.2where pak denotes the set of parents of xk , and x = {x1 , . . . , xK }. This keyequation expresses the factorization properties of the joint distribution for a directedgraphical model. Although we have considered each node to correspond to a singlevariable, we can equally well associate sets of variables and vector-valued variableswith the nodes of a graph.
It is easy to show that the representation on the righthand side of (8.5) is always correctly normalized provided the individual conditionaldistributions are normalized.The directed graphs that we are considering are subject to an important restriction namely that there must be no directed cycles, in other words there are no closedpaths within the graph such that we can move from node to node along links following the direction of the arrows and end up back at the starting node. Such graphs arealso called directed acyclic graphs, or DAGs. This is equivalent to the statement thatthere exists an ordering of the nodes such that there are no links that go from anynode to any lower numbered node.8.1.1 Example: Polynomial regressionAs an illustration of the use of directed graphs to describe probability distributions, we consider the Bayesian polynomial regression model introduced in Sec-8.1.
Bayesian NetworksFigure 8.3363wDirected graphical model representing the jointdistribution (8.6) corresponding to the Bayesianpolynomial regression model introduced in Section 1.2.6.t1tNtion 1.2.6. The random variables in this model are the vector of polynomial coefficients w and the observed data t = (t1 , . . . , tN )T . In addition, this model containsthe input data x = (x1 , . . . , xN )T , the noise variance σ 2 , and the hyperparameter αrepresenting the precision of the Gaussian prior over w, all of which are parametersof the model rather than random variables. Focussing just on the random variablesfor the moment, we see that the joint distribution is given by the product of the priorp(w) and N conditional distributions p(tn |w) for n = 1, . .
. , N so thatp(t, w) = p(w)Np(tn |w).(8.6)n=1This joint distribution can be represented by a graphical model shown in Figure 8.3.When we start to deal with more complex models later in the book, we shall findit inconvenient to have to write out multiple nodes of the form t1 , . . . , tN explicitly asin Figure 8.3. We therefore introduce a graphical notation that allows such multiplenodes to be expressed more compactly, in which we draw a single representativenode tn and then surround this with a box, called a plate, labelled with N indicatingthat there are N nodes of this kind. Re-writing the graph of Figure 8.3 in this way,we obtain the graph shown in Figure 8.4.We shall sometimes find it helpful to make the parameters of a model, as well asits stochastic variables, explicit.
In this case, (8.6) becomesp(t, w|x, α, σ ) = p(w|α)2Np(tn |w, xn , σ 2 ).n=1Correspondingly, we can make x and α explicit in the graphical representation. Todo this, we shall adopt the convention that random variables will be denoted by opencircles, and deterministic parameters will be denoted by smaller solid circles. If wetake the graph of Figure 8.4 and include the deterministic parameters, we obtain thegraph shown in Figure 8.5.When we apply a graphical model to a problem in machine learning or patternrecognition, we will typically set some of the random variables to specific observedFigure 8.4An alternative, more compact, representation of the graphshown in Figure 8.3 in which we have introduced a plate(the box labelled N ) that represents N nodes of which onlya single example tn is shown explicitly.wtnN3648.
GRAPHICAL MODELSFigure 8.5This shows the same model as in Figure 8.4 butwith the deterministic parameters shown explicitlyby the smaller solid nodes.xnαwσ2tnNvalues, for example the variables {tn } from the training set in the case of polynomialcurve fitting. In a graphical model, we will denote such observed variables by shading the corresponding nodes. Thus the graph corresponding to Figure 8.5 in whichthe variables {tn } are observed is shown in Figure 8.6.
Note that the value of w isnot observed, and so w is an example of a latent variable, also known as a hiddenvariable. Such variables play a crucial role in many probabilistic models and willform the focus of Chapters 9 and 12.Having observed the values {tn } we can, if desired, evaluate the posterior distribution of the polynomial coefficients w as discussed in Section 1.2.5. For themoment, we note that this involves a straightforward application of Bayes’ theoremp(w|T) ∝ p(w)Np(tn |w)(8.7)n=1where again we have omitted the deterministic parameters in order to keep the notation uncluttered.In general, model parameters such as w are of little direct interest in themselves,because our ultimate goal is to make predictions for new input values. Suppose wex and we wish to find the corresponding probability disare given a new input value tribution for t conditioned on the observed data.
The graphical model that describesthis problem is shown in Figure 8.7, and the corresponding joint distribution of allof the random variables in this model, conditioned on the deterministic parameters,is then given byN2x, x, α, σ ) =p(tn |xn , w, σ 2 ) p(w|α)p(t|x, w, σ 2 ).(8.8)p(t, t, w|n=1Figure 8.6As in Figure 8.5 but with the nodes {tn } shadedto indicate that the corresponding random variables have been set to their observed (training set)values.xnαwσ2tnN3658.1.
Bayesian NetworksFigure 8.7The polynomial regression model, correspondingto Figure 8.6, showing also a new input value xbtogether with the corresponding model predictionbt.xnαwtnσ2Nx̂t̂The required predictive distribution for t is then obtained, from the sum rule ofprobability, by integrating out the model parameters w so that2p(t|x, x, t, α, σ ) ∝ p(t, t, w|x, x, α, σ 2 ) dwwhere we are implicitly setting the random variables in t to the specific values observed in the data set. The details of this calculation were discussed in Chapter 3.8.1.2 Generative modelsThere are many situations in which we wish to draw samples from a given probability distribution.
Although we shall devote the whole of Chapter 11 to a detaileddiscussion of sampling methods, it is instructive to outline here one technique, calledancestral sampling, which is particularly relevant to graphical models. Consider ajoint distribution p(x1 , . . . , xK ) over K variables that factorizes according to (8.5)corresponding to a directed acyclic graph. We shall suppose that the variables havebeen ordered such that there are no links from any node to any lower numbered node,in other words each node has a higher number than any of its parents.
Our goal is tox1 , . . . , xK from the joint distribution.draw a sample To do this, we start with the lowest-numbered node and draw a sample from thex1 . We then work through each of the nodes in ordistribution p(x1 ), which we call der, so that for node n we draw a sample from the conditional distribution p(xn |pan )in which the parent variables have been set to their sampled values. Note that at eachstage, these parent values will always be available because they correspond to lowernumbered nodes that have already been sampled. Techniques for sampling fromspecific distributions will be discussed in detail in Chapter 11. Once we have sampled from the final variable xK , we will have achieved our objective of obtaining asample from the joint distribution.
To obtain a sample from some marginal distribution corresponding to a subset of the variables, we simply take the sampled valuesfor the required nodes and ignore the sampled values for the remaining nodes. Forexample, to draw a sample from the distribution p(x2 , x4 ), we simply sample fromx2 , x4 and discard the remainingthe full joint distribution and then retain the values values {xj=2,4 }.3668. GRAPHICAL MODELSFigure 8.8A graphical model representing the process by which Objectimages of objects are created, in which the identityof an object (a discrete variable) and the position andorientation of that object (continuous variables) haveindependent prior probabilities. The image (a vectorof pixel intensities) has a probability distribution thatis dependent on the identity of the object as well ason its position and orientation.Position OrientationImageFor practical applications of probabilistic models, it will typically be the highernumbered variables corresponding to terminal nodes of the graph that represent theobservations, with lower-numbered nodes corresponding to latent variables.
Theprimary role of the latent variables is to allow a complicated distribution over theobserved variables to be represented in terms of a model constructed from simpler(typically exponential family) conditional distributions.We can interpret such models as expressing the processes by which the observeddata arose. For instance, consider an object recognition task in which each observeddata point corresponds to an image (comprising a vector of pixel intensities) of oneof the objects. In this case, the latent variables might have an interpretation as theposition and orientation of the object. Given a particular observed image, our goal isto find the posterior distribution over objects, in which we integrate over all possiblepositions and orientations.