Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 26
Текст из файла (страница 26)
In fact, the Bayesianparadigm leads very naturally to a sequential view of the inference problem. To seethis in the context of the inference of the mean of a Gaussian, we write the posteriordistribution with the contribution from the final data point xN separated out so thatN−1p(µ|D) ∝ p(µ)p(xn |µ) p(xN |µ).(2.144)n=1The term in square brackets is (up to a normalization coefficient) just the posteriordistribution after observing N − 1 data points. We see that this can be viewed asa prior distribution, which is combined using Bayes’ theorem with the likelihoodfunction associated with data point xN to arrive at the posterior distribution afterobserving N data points. This sequential view of Bayesian inference is very generaland applies to any problem in which the observed data are assumed to be independentand identically distributed.So far, we have assumed that the variance of the Gaussian distribution over thedata is known and our goal is to infer the mean.
Now let us suppose that the meanis known and we wish to infer the variance. Again, our calculations will be greatlysimplified if we choose a conjugate form for the prior distribution. It turns out to bemost convenient to work with the precision λ ≡ 1/σ 2 . The likelihood function for λtakes the formNNλ−1N/22p(X|λ) =N (xn |µ, λ ) ∝ λexp −(xn − µ) .(2.145)2n=1n=11002. PROBABILITY DISTRIBUTIONS222a = 0.1b = 0.1a=1b=110a=4b=610Figure 2.13a and b.λ1201λ0102λ012Plot of the gamma distribution Gam(λ|a, b) defined by (2.146) for various values of the parametersThe corresponding conjugate prior should therefore be proportional to the productof a power of λ and the exponential of a linear function of λ. This corresponds tothe gamma distribution which is defined byGam(λ|a, b) =Exercise 2.41Exercise 2.421 a a−1exp(−bλ).b λΓ(a)(2.146)Here Γ(a) is the gamma function that is defined by (1.141) and that ensures that(2.146) is correctly normalized.
The gamma distribution has a finite integral if a > 0,and the distribution itself is finite if a 1. It is plotted, for various values of a andb, in Figure 2.13. The mean and variance of the gamma distribution are given byE[λ] =var[λ] =aba.b2(2.147)(2.148)Consider a prior distribution Gam(λ|a0 , b0 ). If we multiply by the likelihoodfunction (2.145), then we obtain a posterior distributionNλa0 −1 N/22p(λ|X) ∝ λλexp −b0 λ −(xn − µ)(2.149)2n=1which we recognize as a gamma distribution of the form Gam(λ|aN , bN ) whereaN= a0 +bN= b0 +N2N12n=1(2.150)(xn − µ)2 = b0 +N 2σ2 ML(2.151)2is the maximum likelihood estimator of the variance.
Note that in (2.149)where σMLthere is no need to keep track of the normalization constants in the prior and thelikelihood function because, if required, the correct coefficient can be found at theend using the normalized form (2.146) for the gamma distribution.2.3. The Gaussian DistributionSection 2.2101From (2.150), we see that the effect of observing N data points is to increasethe value of the coefficient a by N/2. Thus we can interpret the parameter a0 inthe prior in terms of 2a0 ‘effective’ prior observations.
Similarly, from (2.151) we22see that the N data points contribute N σML/2 to the parameter b, where σMListhe variance, and so we can interpret the parameter b0 in the prior as arising fromthe 2a0 ‘effective’ prior observations having variance 2b0 /(2a0 ) = b0 /a0 . Recallthat we made an analogous interpretation for the Dirichlet prior. These distributionsare examples of the exponential family, and we shall see that the interpretation ofa conjugate prior in terms of effective fictitious data points is a general one for theexponential family of distributions.Instead of working with the precision, we can consider the variance itself.
Theconjugate prior in this case is called the inverse gamma distribution, although weshall not discuss this further because we will find it more convenient to work withthe precision.Now suppose that both the mean and the precision are unknown. To find aconjugate prior, we consider the dependence of the likelihood function on µ and λ1/2N λλ2p(X|µ, λ) =exp − (xn − µ)2π2n=1NNN2λµλ∝λ1/2 exp −exp λµxn −x2n .22n=1(2.152)n=1We now wish to identify a prior distribution p(µ, λ) that has the same functionaldependence on µ and λ as the likelihood function and that should therefore take theformβλµ21 /2exp {cλµ − dλ}p(µ, λ) ∝ λ exp −2 βλc22β/2λ(2.153)= exp − (µ − c/β) λ exp − d −22βwhere c, d, and β are constants. Since we can always write p(µ, λ) = p(µ|λ)p(λ),we can find p(µ|λ) and p(λ) by inspection. In particular, we see that p(µ|λ) is aGaussian whose precision is a linear function of λ and that p(λ) is a gamma distribution, so that the normalized prior takes the formp(µ, λ) = N (µ|µ0 , (βλ)−1 )Gam(λ|a, b)(2.154)where we have defined new constants given by µ0 = c/β, a = 1 + β/2, b =d−c2 /2β.
The distribution (2.154) is called the normal-gamma or Gaussian-gammadistribution and is plotted in Figure 2.14. Note that this is not simply the productof an independent Gaussian prior over µ and a gamma prior over λ, because theprecision of µ is a linear function of λ. Even if we chose a prior in which µ and λwere independent, the posterior distribution would exhibit a coupling between theprecision of µ and the value of λ.1022. PROBABILITY DISTRIBUTIONSFigure 2.14Contour plot of the normal-gammadistribution (2.154) for parametervalues µ0 = 0, β = 2, a = 5 andb = 6.2λ 10−2Exercise 2.450µ2In the case of the multivariate Gaussian distribution N x|µ, Λ−1 for a Ddimensional variable x, the conjugate prior distribution for the mean µ, assumingthe precision is known, is again a Gaussian.
For known mean and unknown precisionmatrix Λ, the conjugate prior is the Wishart distribution given by1(ν−D−1)/2−1exp − Tr(W Λ)(2.155)W(Λ|W, ν) = B|Λ|2where ν is called the number of degrees of freedom of the distribution, W is a D ×Dscale matrix, and Tr(·) denotes the trace. The normalization constant B is given byB(W, ν) = |W|−ν/2−1Dν+1−i2νD/2 π D(D−1)/4Γ.2(2.156)i=1Again, it is also possible to define a conjugate prior over the covariance matrix itself,rather than over the precision matrix, which leads to the inverse Wishart distribution, although we shall not discuss this further.
If both the mean and the precisionare unknown, then, following a similar line of reasoning to the univariate case, theconjugate prior is given byp(µ, Λ|µ0 , β, W, ν) = N (µ|µ0 , (βΛ)−1 ) W(Λ|W, ν)(2.157)which is known as the normal-Wishart or Gaussian-Wishart distribution.2.3.7 Student’s t-distributionSection 2.3.6Exercise 2.46We have seen that the conjugate prior for the precision of a Gaussian is givenby a gamma distribution. If we have a univariate Gaussian N (x|µ, τ −1 ) togetherwith a Gamma prior Gam(τ |a, b) and we integrate out the precision, we obtain themarginal distribution of x in the form1032.3.
The Gaussian DistributionFigure 2.15Plot of Student’s t-distribution (2.159)0.5for µ = 0 and λ = 1 for various valuesof ν. The limit ν → ∞ correspondsto a Gaussian distribution with mean 0.4µ and precision λ.ν→∞ν = 1.0ν = 0.10.30.20.10−5∞p(x|µ, a, b) =0==05N (x|µ, τ −1 )Gam(τ |a, b) dτ(2.158) τba e(−bτ ) τ a−1 τ 1/2exp − (x − µ)2 dτΓ(a)2π20 1/2 a2 −a−1/21(x − µ)bb+Γ(a + 1/2)Γ(a) 2π2∞where we have made the change of variable z = τ [b + (x − µ)2 /2].
By conventionwe define new parameters given by ν = 2a and λ = a/b, in terms of which thedistribution p(x|µ, a, b) takes the formSt(x|µ, λ, ν) =Exercise 2.47Exercise 12.24Γ(ν/2 + 1/2)Γ(ν/2)λπν1/2 1+λ(x − µ)2ν−ν/2−1/2(2.159)which is known as Student’s t-distribution.
The parameter λ is sometimes called theprecision of the t-distribution, even though it is not in general equal to the inverseof the variance. The parameter ν is called the degrees of freedom, and its effect isillustrated in Figure 2.15. For the particular case of ν = 1, the t-distribution reducesto the Cauchy distribution, while in the limit ν → ∞ the t-distribution St(x|µ, λ, ν)becomes a Gaussian N (x|µ, λ−1 ) with mean µ and precision λ.From (2.158), we see that Student’s t-distribution is obtained by adding up aninfinite number of Gaussian distributions having the same mean but different precisions. This can be interpreted as an infinite mixture of Gaussians (Gaussian mixtureswill be discussed in detail in Section 2.3.9.
The result is a distribution that in general has longer ‘tails’ than a Gaussian, as was seen in Figure 2.15. This gives the tdistribution an important property called robustness, which means that it is much lesssensitive than the Gaussian to the presence of a few data points which are outliers.The robustness of the t-distribution is illustrated in Figure 2.16, which compares themaximum likelihood solutions for a Gaussian and a t-distribution. Note that the maximum likelihood solution for the t-distribution can be found using the expectationmaximization (EM) algorithm.
Here we see that the effect of a small number of1042. PROBABILITY DISTRIBUTIONS0.50.50.40.40.30.30.20.20.10.10−50510(a)0−50510(b)Figure 2.16 Illustration of the robustness of Student’s t-distribution compared to a Gaussian. (a) Histogramdistribution of 30 data points drawn from a Gaussian distribution, together with the maximum likelihood fit obtained from a t-distribution (red curve) and a Gaussian (green curve, largely hidden by the red curve). Becausethe t-distribution contains the Gaussian as a special case it gives almost the same solution as the Gaussian.(b) The same data set but with three additional outlying data points showing how the Gaussian (green curve) isstrongly distorted by the outliers, whereas the t-distribution (red curve) is relatively unaffected.outliers is much less significant for the t-distribution than for the Gaussian.
Outlierscan arise in practical applications either because the process that generates the datacorresponds to a distribution having a heavy tail or simply through mislabelled data.Robustness is also an important property for regression problems. Unsurprisingly,the least squares approach to regression does not exhibit robustness, because it corresponds to maximum likelihood under a (conditional) Gaussian distribution. Bybasing a regression model on a heavy-tailed distribution such as a t-distribution, weobtain a more robust model.If we go back to (2.158) and substitute the alternative parameters ν = 2a, λ =a/b, and η = τ b/a, we see that the t-distribution can be written in the form ∞N x|µ, (ηλ)−1 Gam(η|ν/2, ν/2) dη.(2.160)St(x|µ, λ, ν) =0We can then generalize this to a multivariate Gaussian N (x|µ, Λ) to obtain the corresponding multivariate Student’s t-distribution in the form ∞N (x|µ, (ηΛ)−1 )Gam(η|ν/2, ν/2) dη.(2.161)St(x|µ, Λ, ν) =0Exercise 2.48Using the same technique as for the univariate case, we can evaluate this integral togive2.3.