Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 24
Текст из файла (страница 24)
Once again, our strategy for evaluating thisdistribution efficiently will be to focus on the quadratic form in the exponent of thejoint distribution and thereby to identify the mean and covariance of the marginaldistribution p(xa ).The quadratic form for the joint distribution can be expressed, using the partitioned precision matrix, in the form (2.70).
Because our goal is to integrate outxb , this is most easily achieved by first considering the terms involving xb and thencompleting the square in order to facilitate integration. Picking out just those termsthat involve xb , we have11 T −11−1−1TT− xTb Λbb xb +xb m = − (xb −Λbb m) Λbb (xb −Λbb m)+ m Λbb m (2.84)222where we have definedm = Λbb µb − Λba (xa − µa ).(2.85)We see that the dependence on xb has been cast into the standard quadratic form of aGaussian distribution corresponding to the first term on the right-hand side of (2.84),plus a term that does not depend on xb (but that does depend on xa ).
Thus, whenwe take the exponential of this quadratic form, we see that the integration over xbrequired by (2.83) will take the form1−1−1Texp − (xb − Λbb m) Λbb (xb − Λbb m) dxb .(2.86)2This integration is easily performed by noting that it is the integral over an unnormalized Gaussian, and so the result will be the reciprocal of the normalization coefficient. We know from the form of the normalized Gaussian given by (2.43), thatthis coefficient is independent of the mean and depends only on the determinant ofthe covariance matrix. Thus, by completing the square with respect to xb , we canintegrate out xb and the only term remaining from the contributions on the left-handside of (2.84) that depends on xa is the last term on the right-hand side of (2.84) inwhich m is given by (2.85). Combining this term with the remaining terms from2.3.
The Gaussian Distribution89(2.70) that depend on xa , we obtain1T1[Λbb µb − Λba (xa − µa )] Λ−bb [Λbb µb − Λba (xa − µa )]21− xTΛaa xa + xTa (Λaa µa + Λab µb ) + const2 a11(Λaa − Λab Λ−= − xTbb Λba )xa2 a−1−1+xTµa + const(2.87)a (Λaa − Λab Λbb Λba )where ‘const’ denotes quantities independent of xa . Again, by comparison with(2.71), we see that the covariance of the marginal distribution of p(xa ) is given by1−1.Σa = (Λaa − Λab Λ−bb Λba )(2.88)Similarly, the mean is given by1Σa (Λaa − Λab Λ−bb Λba )µa = µa(2.89)where we have used (2.88).
The covariance in (2.88) is expressed in terms of thepartitioned precision matrix given by (2.69). We can rewrite this in terms of thecorresponding partitioning of the covariance matrix given by (2.67), as we did forthe conditional distribution. These partitioned matrices are related by−1 Λaa ΛabΣaa Σab=(2.90)Λba ΛbbΣba ΣbbMaking use of (2.76), we then have−11Λaa − Λab Λ−= Σaa .bb Λba(2.91)Thus we obtain the intuitively satisfying result that the marginal distribution p(xa )has mean and covariance given byE[xa ] = µacov[xa ] = Σaa .(2.92)(2.93)We see that for a marginal distribution, the mean and covariance are most simply expressed in terms of the partitioned covariance matrix, in contrast to the conditionaldistribution for which the partitioned precision matrix gives rise to simpler expressions.Our results for the marginal and conditional distributions of a partitioned Gaussian are summarized below.Partitioned GaussiansGiven a joint Gaussian distribution N (x|µ, Σ) with Λ ≡ Σ−1 and xaµax=,µ=xbµb(2.94)902.
PROBABILITY DISTRIBUTIONS110xbp(xa |xb = 0.7)xb = 0.70.55p(xa , xb )p(xa )000.501xa00.5xa1Figure 2.9 The plot on the left shows the contours of a Gaussian distribution p(xa , xb ) over two variables, andthe plot on the right shows the marginal distribution p(xa ) (blue curve) and the conditional distribution p(xa |xb )for xb = 0.7 (red curve).Σ=Σaa Σab,Σba ΣbbΛ=Λaa Λab.Λba Λbb(2.95)Conditional distribution:1p(xa |xb ) = N (x|µa|b , Λ−aa )µa|b= µa −1Λ−aa Λab (xb(2.96)− µb ).(2.97)Marginal distribution:p(xa ) = N (xa |µa , Σaa ).(2.98)We illustrate the idea of conditional and marginal distributions associated witha multivariate Gaussian using an example involving two variables in Figure 2.9.2.3.3 Bayes’ theorem for Gaussian variablesIn Sections 2.3.1 and 2.3.2, we considered a Gaussian p(x) in which we partitioned the vector x into two subvectors x = (xa , xb ) and then found expressions forthe conditional distribution p(xa |xb ) and the marginal distribution p(xa ).
We notedthat the mean of the conditional distribution p(xa |xb ) was a linear function of xb .Here we shall suppose that we are given a Gaussian marginal distribution p(x) and aGaussian conditional distribution p(y|x) in which p(y|x) has a mean that is a linearfunction of x, and a covariance which is independent of x. This is an example of2.3.
The Gaussian Distribution91a linear Gaussian model (Roweis and Ghahramani, 1999), which we shall study ingreater generality in Section 8.1.4. We wish to find the marginal distribution p(y)and the conditional distribution p(x|y). This is a problem that will arise frequentlyin subsequent chapters, and it will prove convenient to derive the general results here.We shall take the marginal and conditional distributions to bep(x) = N x|µ, Λ−1(2.99)−1(2.100)p(y|x) = N y|Ax + b, Lwhere µ, A, and b are parameters governing the means, and Λ and L are precisionmatrices.
If x has dimensionality M and y has dimensionality D, then the matrix Ahas size D × M .First we find an expression for the joint distribution over x and y. To do this, wedefine x(2.101)z=yand then consider the log of the joint distributionln p(z) = ln p(x) + ln p(y|x)1= − (x − µ)T Λ(x − µ)21− (y − Ax − b)T L(y − Ax − b) + const2(2.102)where ‘const’ denotes terms independent of x and y.
As before, we see that this is aquadratic function of the components of z, and hence p(z) is Gaussian distribution.To find the precision of this Gaussian, we consider the second order terms in (2.102),which can be written as1111− xT (Λ + AT LA)x − yT Ly + yT LAx + xT AT Ly2222 T 11 xxΛ + AT LA −AT L= − zT Rz= −yy−LAL22(2.103)and so the Gaussian distribution over z has precision (inverse covariance) matrixgiven byΛ + AT LA −AT L.(2.104)R=−LALExercise 2.29The covariance matrix is found by taking the inverse of the precision, which can bedone using the matrix inversion formula (2.76) to give −1ΛΛ−1 AT−1.(2.105)cov[z] = R =AΛ−1 L−1 + AΛ−1 AT922.
PROBABILITY DISTRIBUTIONSSimilarly, we can find the mean of the Gaussian distribution over z by identifying the linear terms in (2.102), which are given by T xΛµ − AT LbTT TTx Λµ − x A Lb + y Lb =.(2.106)yLbUsing our earlier result (2.71) obtained by completing the square over the quadraticform of a multivariate Gaussian, we find that the mean of z is given byT−1 Λµ − A Lb.(2.107)E[z] = RLbExercise 2.30Making use of (2.105), we then obtainE[z] =Section 2.3µ.Aµ + bNext we find an expression for the marginal distribution p(y) in which we havemarginalized over x. Recall that the marginal distribution over a subset of the components of a Gaussian random vector takes a particularly simple form when expressed in terms of the partitioned covariance matrix. Specifically, its mean andcovariance are given by (2.92) and (2.93), respectively.
Making use of (2.105) and(2.108) we see that the mean and covariance of the marginal distribution p(y) aregiven byE[y] = Aµ + bcov[y] = L−1 + AΛ−1 AT .Section 2.3(2.108)(2.109)(2.110)A special case of this result is when A = I, in which case it reduces to the convolution of two Gaussians, for which we see that the mean of the convolution is the sumof the mean of the two Gaussians, and the covariance of the convolution is the sumof their covariances.Finally, we seek an expression for the conditional p(x|y).
Recall that the resultsfor the conditional distribution are most easily expressed in terms of the partitionedprecision matrix, using (2.73) and (2.75). Applying these results to (2.105) and(2.108) we see that the conditional distribution p(x|y) has mean and covariancegiven by(2.111)E[x|y] = (Λ + AT LA)−1 AT L(y − b) + Λµcov[x|y] = (Λ + AT LA)−1 .(2.112)The evaluation of this conditional can be seen as an example of Bayes’ theorem.We can interpret the distribution p(x) as a prior distribution over x.
If the variabley is observed, then the conditional distribution p(x|y) represents the correspondingposterior distribution over x. Having found the marginal and conditional distributions, we effectively expressed the joint distribution p(z) = p(x)p(y|x) in the formp(x|y)p(y). These results are summarized below.2.3. The Gaussian Distribution93Marginal and Conditional GaussiansGiven a marginal Gaussian distribution for x and a conditional Gaussian distribution for y given x in the formp(x) = N (x|µ, Λ−1 )p(y|x) = N (y|Ax + b, L−1 )(2.113)(2.114)the marginal distribution of y and the conditional distribution of x given y aregiven byp(y) = N (y|Aµ + b, L−1 + AΛ−1 AT )p(x|y) = N (x|Σ{AT L(y − b) + Λµ}, Σ)(2.115)(2.116)Σ = (Λ + AT LA)−1 .(2.117)where2.3.4 Maximum likelihood for the GaussianGiven a data set X = (x1 , .
. . , xN )T in which the observations {xn } are assumed to be drawn independently from a multivariate Gaussian distribution, we canestimate the parameters of the distribution by maximum likelihood. The log likelihood function is given byN1NDln(2π)− ln |Σ|−(xn −µ)T Σ−1 (xn −µ). (2.118)222Nln p(X|µ, Σ) = −n=1By simple rearrangement, we see that the likelihood function depends on the data setonly through the two quantitiesNNxn ,n=1Appendix Cxn xTn.(2.119)n=1These are known as the sufficient statistics for the Gaussian distribution.