Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 22
Текст из файла (страница 22)
Here {αk } = 0.1 on theleft plot, {αk } = 1 in the centre plot, and {αk } = 10 in the right plot.modelled using the binomial distribution (2.9) or as 1-of-2 variables and modelledusing the multinomial distribution (2.34) with K = 2.2.3. The Gaussian DistributionThe Gaussian, also known as the normal distribution, is a widely used model for thedistribution of continuous variables. In the case of a single variable x, the Gaussiandistribution can be written in the form112N (x|µ, σ 2 ) =exp−(x−µ)(2.42)1/22σ 2(2πσ 2 )where µ is the mean and σ 2 is the variance. For a D-dimensional vector x, themultivariate Gaussian distribution takes the form111T −1exp−Σ(x−µ)(2.43)N (x|µ, Σ) =(x−µ)2(2π)D/2 |Σ|1/2Section 1.6Exercise 2.14where µ is a D-dimensional mean vector, Σ is a D × D covariance matrix, and |Σ|denotes the determinant of Σ.The Gaussian distribution arises in many different contexts and can be motivatedfrom a variety of different perspectives.
For example, we have already seen that fora single real variable, the distribution that maximizes the entropy is the Gaussian.This property applies also to the multivariate Gaussian.Another situation in which the Gaussian distribution arises is when we considerthe sum of multiple random variables. The central limit theorem (due to Laplace)tells us that, subject to certain mild conditions, the sum of a set of random variables,which is of course itself a random variable, has a distribution that becomes increasingly Gaussian as the number of terms in the sum increases (Walker, 1969). We can792.3.
The Gaussian Distribution33N =13N =2222111000.51000.510N = 1000.51Figure 2.6 Histogram plots of the mean of N uniformly distributed numbers for various values of N . Weobserve that as N increases, the distribution tends towards a Gaussian.Appendix Cillustrate this by considering N variables x1 , . . . , xN each of which has a uniformdistribution over the interval [0, 1] and then considering the distribution of the mean(x1 + · · · + xN )/N . For large N , this distribution tends to a Gaussian, as illustratedin Figure 2.6.
In practice, the convergence to a Gaussian as N increases can bevery rapid. One consequence of this result is that the binomial distribution (2.9),which is a distribution over m defined by the sum of N observations of the randombinary variable x, will tend to a Gaussian as N → ∞ (see Figure 2.1 for the case ofN = 10).The Gaussian distribution has many important analytical properties, and we shallconsider several of these in detail.
As a result, this section will be rather more technically involved than some of the earlier sections, and will require familiarity withvarious matrix identities. However, we strongly encourage the reader to become proficient in manipulating Gaussian distributions using the techniques presented here asthis will prove invaluable in understanding the more complex models presented inlater chapters.We begin by considering the geometrical form of the Gaussian distribution. TheCarl Friedrich Gauss1777–1855It is said that when Gauss wentto elementary school at age 7, histeacher Büttner, trying to keep theclass occupied, asked the pupils tosum the integers from 1 to 100.
Tothe teacher’s amazement, Gaussarrived at the answer in a matter of moments by notingthat the sum can be represented as 50 pairs (1 + 100,2+99, etc.) each of which added to 101, giving the answer 5,050. It is now believed that the problem whichwas actually set was of the same form but somewhatharder in that the sequence had a larger starting valueand a larger increment. Gauss was a German math-ematician and scientist with a reputation for being ahard-working perfectionist. One of his many contributions was to show that least squares can be derivedunder the assumption of normally distributed errors.He also created an early formulation of non-Euclideangeometry (a self-consistent geometrical theory that violates the axioms of Euclid) but was reluctant to discuss it openly for fear that his reputation might sufferif it were seen that he believed in such a geometry.At one point, Gauss was asked to conduct a geodeticsurvey of the state of Hanover, which led to his formulation of the normal distribution, now also knownas the Gaussian.
After his death, a study of his diaries revealed that he had discovered several important mathematical results years or even decades before they were published by others.802. PROBABILITY DISTRIBUTIONSfunctional dependence of the Gaussian on x is through the quadratic form∆2 = (x − µ)T Σ−1 (x − µ)Exercise 2.17which appears in the exponent.
The quantity ∆ is called the Mahalanobis distancefrom µ to x and reduces to the Euclidean distance when Σ is the identity matrix. TheGaussian distribution will be constant on surfaces in x-space for which this quadraticform is constant.First of all, we note that the matrix Σ can be taken to be symmetric, withoutloss of generality, because any antisymmetric component would disappear from theexponent. Now consider the eigenvector equation for the covariance matrixΣui = λi uiExercise 2.18(2.45)where i = 1, .
. . , D. Because Σ is a real, symmetric matrix its eigenvalues will bereal, and its eigenvectors can be chosen to form an orthonormal set, so thatuTi uj = Iijwhere Iij is the i, j element of the identity matrix and satisfies1, if i = jIij =0, otherwise.Exercise 2.19(2.44)(2.46)(2.47)The covariance matrix Σ can be expressed as an expansion in terms of its eigenvectors in the formDλ i ui uT(2.48)Σ=ii=1and similarly the inverse covariance matrix Σ−1 can be expressed as−1ΣD1=ui uTi .λi(2.49)i=1Substituting (2.49) into (2.44), the quadratic form becomes2∆ =Dy2ii=1where we have definedλiy i = uTi (x − µ).(2.50)(2.51)We can interpret {yi } as a new coordinate system defined by the orthonormal vectorsui that are shifted and rotated with respect to the original xi coordinates. Formingthe vector y = (y1 , . . .
, yD )T , we havey = U(x − µ)(2.52)812.3. The Gaussian DistributionFigure 2.7The red curve shows the ellip- x2tical surface of constant probability density for a Gaussian ina two-dimensional space x =(x1 , x2 ) on which the densityis exp(−1/2) of its value atx = µ. The major axes ofthe ellipse are defined by theeigenvectors ui of the covariance matrix, with corresponding eigenvalues λi .u2u1y2y1µ1/2λ21/2λ1x1Appendix Cwhere U is a matrix whose rows are given by uTi .
From (2.46) it follows that U isan orthogonal matrix, i.e., it satisfies UUT = I, and hence also UT U = I, where Iis the identity matrix.The quadratic form, and hence the Gaussian density, will be constant on surfacesfor which (2.51) is constant. If all of the eigenvalues λi are positive, then thesesurfaces represent ellipsoids, with their centres at µ and their axes oriented along ui ,1 /2and with scaling factors in the directions of the axes given by λi , as illustrated inFigure 2.7.For the Gaussian distribution to be well defined, it is necessary for all of theeigenvalues λi of the covariance matrix to be strictly positive, otherwise the distribution cannot be properly normalized.
A matrix whose eigenvalues are strictlypositive is said to be positive definite. In Chapter 12, we will encounter Gaussiandistributions for which one or more of the eigenvalues are zero, in which case thedistribution is singular and is confined to a subspace of lower dimensionality. If allof the eigenvalues are nonnegative, then the covariance matrix is said to be positivesemidefinite.Now consider the form of the Gaussian distribution in the new coordinate systemdefined by the yi .
In going from the x to the y coordinate system, we have a Jacobianmatrix J with elements given byJij =∂xi= Uji∂yj(2.53)where Uji are the elements of the matrix UT . Using the orthonormality property ofthe matrix U, we see that the square of the determinant of the Jacobian matrix is 2 |J|2 = UT = UT |U| = UT U = |I| = 1(2.54)and hence |J| = 1.
Also, the determinant |Σ| of the covariance matrix can be written822. PROBABILITY DISTRIBUTIONSas the product of its eigenvalues, and hence1 /2|Σ|=D1 /2λj .(2.55)j =1Thus in the yj coordinate system, the Gaussian distribution takes the formp(y) = p(x)|J| =Dj =1yj21exp −2λj(2πλj )1/2(2.56)which is the product of D independent univariate Gaussian distributions. The eigenvectors therefore define a new set of shifted and rotated coordinates with respectto which the joint probability distribution factorizes into a product of independentdistributions.
The integral of the distribution in the y coordinate system is thenp(y) dy =D j =1∞−∞yj21exp−dyj = 12λj(2πλj )1/2(2.57)where we have used the result (1.48) for the normalization of the univariate Gaussian.This confirms that the multivariate Gaussian (2.43) is indeed normalized.We now look at the moments of the Gaussian distribution and thereby provide aninterpretation of the parameters µ and Σ. The expectation of x under the Gaussiandistribution is given by111T −1exp − (x − µ) Σ (x − µ) x dxE[x] =2(2π)D/2 |Σ|1/211 T −11=exp − z Σ z (z + µ) dz(2.58)2(2π)D/2 |Σ|1/2where we have changed variables using z = x − µ.
We now note that the exponentis an even function of the components of z and, because the integrals over these aretaken over the range (−∞, ∞), the term in z in the factor (z + µ) will vanish bysymmetry. ThusE[x] = µ(2.59)and so we refer to µ as the mean of the Gaussian distribution.We now consider second order moments of the Gaussian. In the univariate case,we considered the second order moment given by E[x2 ]. For the multivariate Gaussian, there are D2 second order moments given by E[xi xj ], which we can grouptogether to form the matrix E[xxT ]. This matrix can be written as111TT −1E[xx ] =exp − (x − µ) Σ (x − µ) xxT dx2(2π)D/2 |Σ|1/2111 T −1=exp − z Σ z (z + µ)(z + µ)T dz2(2π)D/2 |Σ|1/22.3. The Gaussian Distribution83where again we have changed variables using z = x − µ.
Note that the cross-termsinvolving µzT and µT z will again vanish by symmetry. The term µµT is constantand can be taken outside the integral, which itself is unity because the Gaussiandistribution is normalized. Consider the term involving zzT . Again, we can makeuse of the eigenvector expansion of the covariance matrix given by (2.45), togetherwith the completeness of the set of eigenvectors, to writez=Dy j uj(2.60)j =1where yj = uTj z, which gives1 T −1exp − z Σ z zzT dz2 DDD y21 1kui uTexp −yi yj dyj2λk(2π)D/2 |Σ|1/2i=1 j =111D/2(2π)|Σ|1/2=k=1=Dui uTi λi = Σ(2.61)i=1where we have made use of the eigenvector equation (2.45), together with the factthat the integral on the right-hand side of the middle line vanishes by symmetryunless i = j, and in the final line we have made use of the results (1.50) and (2.55),together with (2.48).