Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 83
Текст из файла (страница 83)
Show that the predictive variance is given by (7.91).7.15 ( ) www Using the results (7.94) and (7.95), show that the marginal likelihood(7.85) can be written in the form (7.96), where λ(αn ) is defined by (7.97) and thesparsity and quality factors are defined by (7.98) and (7.99), respectively.7.16 () By taking the second derivative of the log marginal likelihood (7.97) for theregression RVM with respect to the hyperparameter αi , show that the stationarypoint given by (7.101) is a maximum of the marginal likelihood.7.17 ( ) Using (7.83) and (7.86), together with the matrix identity (C.7), show thatthe quantities Sn and Qn defined by (7.102) and (7.103) can be written in the form(7.106) and (7.107).7.18 () www Show that the gradient vector and Hessian matrix of the log posterior distribution (7.109) for the classification relevance vector machine are given by(7.110) and (7.111).7.19 ( ) Verify that maximization of the approximate log marginal likelihood function(7.114) for the classification relevance vector machine leads to the result (7.116) forre-estimation of the hyperparameters.8GraphicalModelsProbabilities play a central role in modern pattern recognition.
We have seen inChapter 1 that probability theory can be expressed in terms of two simple equationscorresponding to the sum rule and the product rule. All of the probabilistic inference and learning manipulations discussed in this book, no matter how complex,amount to repeated application of these two equations. We could therefore proceedto formulate and solve complicated probabilistic models purely by algebraic manipulation. However, we shall find it highly advantageous to augment the analysisusing diagrammatic representations of probability distributions, called probabilisticgraphical models.
These offer several useful properties:1. They provide a simple way to visualize the structure of a probabilistic modeland can be used to design and motivate new models.2. Insights into the properties of the model, including conditional independenceproperties, can be obtained by inspection of the graph.3593608. GRAPHICAL MODELS3.
Complex computations, required to perform inference and learning in sophisticated models, can be expressed in terms of graphical manipulations, in whichunderlying mathematical expressions are carried along implicitly.A graph comprises nodes (also called vertices) connected by links (also knownas edges or arcs). In a probabilistic graphical model, each node represents a randomvariable (or group of random variables), and the links express probabilistic relationships between these variables. The graph then captures the way in which the jointdistribution over all of the random variables can be decomposed into a product offactors each depending only on a subset of the variables.
We shall begin by discussing Bayesian networks, also known as directed graphical models, in which thelinks of the graphs have a particular directionality indicated by arrows. The othermajor class of graphical models are Markov random fields, also known as undirectedgraphical models, in which the links do not carry arrows and have no directionalsignificance. Directed graphs are useful for expressing causal relationships betweenrandom variables, whereas undirected graphs are better suited to expressing soft constraints between random variables. For the purposes of solving inference problems,it is often convenient to convert both directed and undirected graphs into a differentrepresentation called a factor graph.In this chapter, we shall focus on the key aspects of graphical models as neededfor applications in pattern recognition and machine learning.
More general treatments of graphical models can be found in the books by Whittaker (1990), Lauritzen(1996), Jensen (1996), Castillo et al. (1997), Jordan (1999), Cowell et al. (1999),and Jordan (2007).8.1. Bayesian NetworksIn order to motivate the use of directed graphs to describe probability distributions,consider first an arbitrary joint distribution p(a, b, c) over three variables a, b, and c.Note that at this stage, we do not need to specify anything further about these variables, such as whether they are discrete or continuous. Indeed, one of the powerfulaspects of graphical models is that a specific graph can make probabilistic statementsfor a broad class of distributions.
By application of the product rule of probability(1.11), we can write the joint distribution in the formp(a, b, c) = p(c|a, b)p(a, b).(8.1)A second application of the product rule, this time to the second term on the righthand side of (8.1), givesp(a, b, c) = p(c|a, b)p(b|a)p(a).(8.2)Note that this decomposition holds for any choice of the joint distribution. We nowrepresent the right-hand side of (8.2) in terms of a simple graphical model as follows.First we introduce a node for each of the random variables a, b, and c and associateeach node with the corresponding conditional distribution on the right-hand side of3618.1. Bayesian NetworksFigure 8.1A directed graphical model representing the joint probability distribution over three variables a, b, and c, corresponding to the decomposition on the right-hand side of (8.2).abc(8.2).
Then, for each conditional distribution we add directed links (arrows) to thegraph from the nodes corresponding to the variables on which the distribution isconditioned. Thus for the factor p(c|a, b), there will be links from nodes a and b tonode c, whereas for the factor p(a) there will be no incoming links. The result is thegraph shown in Figure 8.1. If there is a link going from a node a to a node b, then wesay that node a is the parent of node b, and we say that node b is the child of node a.Note that we shall not make any formal distinction between a node and the variableto which it corresponds but will simply use the same symbol to refer to both.An interesting point to note about (8.2) is that the left-hand side is symmetricalwith respect to the three variables a, b, and c, whereas the right-hand side is not.Indeed, in making the decomposition in (8.2), we have implicitly chosen a particularordering, namely a, b, c, and had we chosen a different ordering we would haveobtained a different decomposition and hence a different graphical representation.We shall return to this point later.For the moment let us extend the example of Figure 8.1 by considering the jointdistribution over K variables given by p(x1 , .
. . , xK ). By repeated application ofthe product rule of probability, this joint distribution can be written as a product ofconditional distributions, one for each of the variablesp(x1 , . . . , xK ) = p(xK |x1 , . . . , xK−1 ) . . . p(x2 |x1 )p(x1 ).(8.3)For a given choice of K, we can again represent this as a directed graph having Knodes, one for each conditional distribution on the right-hand side of (8.3), with eachnode having incoming links from all lower numbered nodes.
We say that this graphis fully connected because there is a link between every pair of nodes.So far, we have worked with completely general joint distributions, so that thedecompositions, and their representations as fully connected graphs, will be applicable to any choice of distribution. As we shall see shortly, it is the absence of linksin the graph that conveys interesting information about the properties of the class ofdistributions that the graph represents.
Consider the graph shown in Figure 8.2. Thisis not a fully connected graph because, for instance, there is no link from x1 to x2 orfrom x3 to x7 .We shall now go from this graph to the corresponding representation of the jointprobability distribution written in terms of the product of a set of conditional distributions, one for each node in the graph.
Each such conditional distribution willbe conditioned only on the parents of the corresponding node in the graph. For instance, x5 will be conditioned on x1 and x3 . The joint distribution of all 7 variables3628. GRAPHICAL MODELSFigure 8.2Example of a directed acyclic graph describing the jointdistribution over variables x1 , . . . , x7 . The correspondingdecomposition of the joint distribution is given by (8.4).x1x2x3x4x5x6x7is therefore given byp(x1 )p(x2 )p(x3 )p(x4 |x1 , x2 , x3 )p(x5 |x1 , x3 )p(x6 |x4 )p(x7 |x4 , x5 ).(8.4)The reader should take a moment to study carefully the correspondence between(8.4) and Figure 8.2.We can now state in general terms the relationship between a given directedgraph and the corresponding distribution over the variables. The joint distributiondefined by a graph is given by the product, over all of the nodes of the graph, ofa conditional distribution for each node conditioned on the variables correspondingto the parents of that node in the graph.