Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 23
Текст из файла (страница 23)
Thus we haveE[xxT ] = µµT + Σ.(2.62)For single random variables, we subtracted the mean before taking second moments in order to define a variance. Similarly, in the multivariate case it is againconvenient to subtract off the mean, giving rise to the covariance of a random vectorx defined by(2.63)cov[x] = E (x − E[x])(x − E[x])T .For the specific case of a Gaussian distribution, we can make use of E[x] = µ,together with the result (2.62), to givecov[x] = Σ.Exercise 2.21(2.64)Because the parameter matrix Σ governs the covariance of x under the Gaussiandistribution, it is called the covariance matrix.Although the Gaussian distribution (2.43) is widely used as a density model, itsuffers from some significant limitations.
Consider the number of free parameters inthe distribution. A general symmetric covariance matrix Σ will have D(D + 1)/2independent parameters, and there are another D independent parameters in µ, giving D(D + 3)/2 parameters in total. For large D, the total number of parameters842. PROBABILITY DISTRIBUTIONSFigure 2.8 Contours of constant x2probability density for a Gaussiandistribution in two dimensions inwhich the covariance matrix is (a) ofgeneral form, (b) diagonal, in whichthe elliptical contours are alignedwith the coordinate axes, and (c)proportional to the identity matrix, inwhich the contours are concentriccircles.Section 8.3Section 13.3x2x2x1(a)x1(b)x1(c)therefore grows quadratically with D, and the computational task of manipulatingand inverting large matrices can become prohibitive.
One way to address this problem is to use restricted forms of the covariance matrix. If we consider covariancematrices that are diagonal, so that Σ = diag(σi2 ), we then have a total of 2D independent parameters in the density model. The corresponding contours of constantdensity are given by axis-aligned ellipsoids. We could further restrict the covariancematrix to be proportional to the identity matrix, Σ = σ 2 I, known as an isotropic covariance, giving D + 1 independent parameters in the model and spherical surfacesof constant density.
The three possibilities of general, diagonal, and isotropic covariance matrices are illustrated in Figure 2.8. Unfortunately, whereas such approacheslimit the number of degrees of freedom in the distribution and make inversion of thecovariance matrix a much faster operation, they also greatly restrict the form of theprobability density and limit its ability to capture interesting correlations in the data.A further limitation of the Gaussian distribution is that it is intrinsically unimodal (i.e., has a single maximum) and so is unable to provide a good approximationto multimodal distributions.
Thus the Gaussian distribution can be both too flexible,in the sense of having too many parameters, while also being too limited in the rangeof distributions that it can adequately represent. We will see later that the introduction of latent variables, also called hidden variables or unobserved variables, allowsboth of these problems to be addressed. In particular, a rich family of multimodaldistributions is obtained by introducing discrete latent variables leading to mixturesof Gaussians, as discussed in Section 2.3.9. Similarly, the introduction of continuouslatent variables, as described in Chapter 12, leads to models in which the number offree parameters can be controlled independently of the dimensionality D of the dataspace while still allowing the model to capture the dominant correlations in the dataset. Indeed, these two approaches can be combined and further extended to derivea very rich set of hierarchical models that can be adapted to a broad range of practical applications.
For instance, the Gaussian version of the Markov random field,which is widely used as a probabilistic model of images, is a Gaussian distributionover the joint space of pixel intensities but rendered tractable through the impositionof considerable structure reflecting the spatial organization of the pixels.
Similarly,the linear dynamical system, used to model time series data for applications suchas tracking, is also a joint Gaussian distribution over a potentially large number ofobserved and latent variables and again is tractable due to the structure imposed onthe distribution. A powerful framework for expressing the form and properties of2.3. The Gaussian Distribution85such complex distributions is that of probabilistic graphical models, which will formthe subject of Chapter 8.2.3.1 Conditional Gaussian distributionsAn important property of the multivariate Gaussian distribution is that if twosets of variables are jointly Gaussian, then the conditional distribution of one setconditioned on the other is again Gaussian. Similarly, the marginal distribution ofeither set is also Gaussian.Consider first the case of conditional distributions.
Suppose x is a D-dimensionalvector with Gaussian distribution N (x|µ, Σ) and that we partition x into two disjoint subsets xa and xb . Without loss of generality, we can take xa to form the firstM components of x, with xb comprising the remaining D − M components, so that xax=.(2.65)xbWe also define corresponding partitions of the mean vector µ given by µaµ=µband of the covariance matrix Σ given byΣaa ΣabΣ=.Σba Σbb(2.66)(2.67)Note that the symmetry ΣT = Σ of the covariance matrix implies that Σaa and Σbbare symmetric, while Σba = ΣTab .In many situations, it will be convenient to work with the inverse of the covariance matrix(2.68)Λ ≡ Σ−1which is known as the precision matrix. In fact, we shall see that some propertiesof Gaussian distributions are most naturally expressed in terms of the covariance,whereas others take a simpler form when viewed in terms of the precision.
Wetherefore also introduce the partitioned form of the precision matrixΛaa Λab(2.69)Λ=Λba ΛbbExercise 2.22corresponding to the partitioning (2.65) of the vector x. Because the inverse of asymmetric matrix is also symmetric, we see that Λaa and Λbb are symmetric, whileΛTab = Λba . It should be stressed at this point that, for instance, Λaa is not simplygiven by the inverse of Σaa .
In fact, we shall shortly examine the relation betweenthe inverse of a partitioned matrix and the inverses of its partitions.Let us begin by finding an expression for the conditional distribution p(xa |xb ).From the product rule of probability, we see that this conditional distribution can be862. PROBABILITY DISTRIBUTIONSevaluated from the joint distribution p(x) = p(xa , xb ) simply by fixing xb to theobserved value and normalizing the resulting expression to obtain a valid probabilitydistribution over xa . Instead of performing this normalization explicitly, we canobtain the solution more efficiently by considering the quadratic form in the exponentof the Gaussian distribution given by (2.44) and then reinstating the normalizationcoefficient at the end of the calculation.
If we make use of the partitioning (2.65),(2.66), and (2.69), we obtain1− (x − µ)T Σ−1 (x − µ) =211− (xa − µa )T Λaa (xa − µa ) − (xa − µa )T Λab (xb − µb )2211T− (xb − µb ) Λba (xa − µa ) − (xb − µb )T Λbb (xb − µb ). (2.70)22We see that as a function of xa , this is again a quadratic form, and hence the corresponding conditional distribution p(xa |xb ) will be Gaussian. Because this distribution is completely characterized by its mean and its covariance, our goal will beto identify expressions for the mean and covariance of p(xa |xb ) by inspection of(2.70).This is an example of a rather common operation associated with Gaussiandistributions, sometimes called ‘completing the square’, in which we are given aquadratic form defining the exponent terms in a Gaussian distribution, and we needto determine the corresponding mean and covariance.
Such problems can be solvedstraightforwardly by noting that the exponent in a general Gaussian distributionN (x|µ, Σ) can be written11− (x − µ)T Σ−1 (x − µ) = − xT Σ−1 x + xT Σ−1 µ + const22(2.71)where ‘const’ denotes terms which are independent of x, and we have made use ofthe symmetry of Σ. Thus if we take our general quadratic form and express it inthe form given by the right-hand side of (2.71), then we can immediately equate thematrix of coefficients entering the second order term in x to the inverse covariancematrix Σ−1 and the coefficient of the linear term in x to Σ−1 µ, from which we canobtain µ.Now let us apply this procedure to the conditional Gaussian distribution p(xa |xb )for which the quadratic form in the exponent is given by (2.70). We will denote themean and covariance of this distribution by µa|b and Σa|b , respectively. Considerthe functional dependence of (2.70) on xa in which xb is regarded as a constant.
Ifwe pick out all terms that are second order in xa , we have1− xTΛaa xa2 a(2.72)from which we can immediately conclude that the covariance (inverse precision) ofp(xa |xb ) is given by1Σa|b = Λ−(2.73)aa .2.3. The Gaussian Distribution87Now consider all of the terms in (2.70) that are linear in xaxTa {Λaa µa − Λab (xb − µb )}(2.74)where we have used ΛTba = Λab . From our discussion of the general form (2.71),1the coefficient of xa in this expression must equal Σ−a|b µa|b and henceµa|b= Σa|b {Λaa µa − Λab (xb − µb )}1= µa − Λ −aa Λab (xb − µb )Exercise 2.24(2.75)where we have made use of (2.73).The results (2.73) and (2.75) are expressed in terms of the partitioned precisionmatrix of the original joint distribution p(xa , xb ).
We can also express these resultsin terms of the corresponding partitioned covariance matrix. To do this, we make useof the following identity for the inverse of a partitioned matrix−1 A BM−MBD−1=(2.76)C D−D−1 CM D−1 + D−1 CMBD−1where we have definedM = (A − BD−1 C)−1 .(2.77)−1The quantity M is known as the Schur complement of the matrix on the left-handside of (2.76) with respect to the submatrix D. Using the definition−1 Σaa ΣabΛaa Λab=(2.78)Σba ΣbbΛba Λbband making use of (2.76), we have1−1Λaa = (Σaa − Σab Σ−bb Σba )(2.79)Λab = −(Σaa −(2.80)11−1Σab Σ−Σab Σ−bb Σba )bb .From these we obtain the following expressions for the mean and covariance of theconditional distribution p(xa |xb )Section 8.1.4µa|b1= µa + Σab Σ−bb (xb − µb )(2.81)Σa|b1= Σaa − Σab Σ−bb Σba .(2.82)Comparing (2.73) and (2.82), we see that the conditional distribution p(xa |xb ) takesa simpler form when expressed in terms of the partitioned precision matrix thanwhen it is expressed in terms of the partitioned covariance matrix.
Note that themean of the conditional distribution p(xa |xb ), given by (2.81), is a linear function ofxb and that the covariance, given by (2.82), is independent of xa . This represents anexample of a linear-Gaussian model.882. PROBABILITY DISTRIBUTIONS2.3.2 Marginal Gaussian distributionsWe have seen that if a joint distribution p(xa , xb ) is Gaussian, then the conditional distribution p(xa |xb ) will again be Gaussian. Now we turn to a discussion ofthe marginal distribution given by(2.83)p(xa ) = p(xa , xb ) dxbwhich, as we shall see, is also Gaussian.