Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 25
Текст из файла (страница 25)
Using(C.19), the derivative of the log likelihood with respect to µ is given by∂ln p(X|µ, Σ) =Σ−1 (xn − µ)∂µN(2.120)n=1and setting this derivative to zero, we obtain the solution for the maximum likelihoodestimate of the mean given byµML =N1 xnNn=1(2.121)942. PROBABILITY DISTRIBUTIONSExercise 2.34which is the mean of the observed set of data points. The maximization of (2.118)with respect to Σ is rather more involved.
The simplest approach is to ignore thesymmetry constraint and show that the resulting solution is symmetric as required.Alternative derivations of this result, which impose the symmetry and positive definiteness constraints explicitly, can be found in Magnus and Neudecker (1999). Theresult is as expected and takes the formΣMLN1 =(xn − µML )(xn − µML )TN(2.122)n=1Exercise 2.35which involves µML because this is the result of a joint maximization with respectto µ and Σ. Note that the solution (2.121) for µML does not depend on ΣML , and sowe can first evaluate µML and then use this to evaluate ΣML .If we evaluate the expectations of the maximum likelihood solutions under thetrue distribution, we obtain the following resultsE[µML ] = µN −1Σ.E[ΣML ] =N(2.123)(2.124)We see that the expectation of the maximum likelihood estimate for the mean is equalto the true mean.
However, the maximum likelihood estimate for the covariance hasan expectation that is less than the true value, and hence it is biased. We can correct given bythis bias by defining a different estimator Σ=Σ1 (xn − µML )(xn − µML )T .N −1N(2.125)n=1 is equal to Σ.Clearly from (2.122) and (2.124), the expectation of Σ2.3.5 Sequential estimationOur discussion of the maximum likelihood solution for the parameters of a Gaussian distribution provides a convenient opportunity to give a more general discussionof the topic of sequential estimation for maximum likelihood. Sequential methodsallow data points to be processed one at a time and then discarded and are importantfor on-line applications, and also where large data sets are involved so that batchprocessing of all data points at once is infeasible.Consider the result (2.121) for the maximum likelihood estimator of the mean(N )µML , which we will denote by µML when it is based on N observations. If we2.3.
The Gaussian DistributionFigure 2.10A schematic illustration of two correlated random variables z and θ, together with theregression function f (θ) given by the conditional expectation E[z|θ]. The RobbinsMonro algorithm provides a general sequential procedure for finding the root θ of suchfunctions.95zf (θ)θθdissect out the contribution from the final data point xN , we obtain(N )µML=N1 xnNn=1=N −111 xnxN +NNn=11N − 1 (N −1)=xN +µMLNN1(N −1)(N −1)= µML + (xN − µML ).N(2.126)This result has a nice interpretation, as follows.
After observing N − 1 data points(N −1)we have estimated µ by µML . We now observe data point xN , and we obtain our(N )revised estimate µML by moving the old estimate a small amount, proportional to(N −1)1/N , in the direction of the ‘error signal’ (xN − µML ).
Note that, as N increases,so the contribution from successive data points gets smaller.The result (2.126) will clearly give the same answer as the batch result (2.121)because the two formulae are equivalent. However, we will not always be able to derive a sequential algorithm by this route, and so we seek a more general formulationof sequential learning, which leads us to the Robbins-Monro algorithm. Consider apair of random variables θ and z governed by a joint distribution p(z, θ). The conditional expectation of z given θ defines a deterministic function f (θ) that is givenby(2.127)f (θ) ≡ E[z|θ] = zp(z|θ) dzand is illustrated schematically in Figure 2.10.
Functions defined in this way arecalled regression functions.Our goal is to find the root θ at which f (θ ) = 0. If we had a large data setof observations of z and θ, then we could model the regression function directly andthen obtain an estimate of its root. Suppose, however, that we observe values ofz one at a time and we wish to find a corresponding sequential estimation schemefor θ . The following general procedure for solving such problems was given by962. PROBABILITY DISTRIBUTIONSRobbins and Monro (1951).
We shall assume that the conditional variance of z isfinite so thatE (z − f )2 | θ < ∞(2.128)and we shall also, without loss of generality, consider the case where f (θ) > 0 forθ > θ and f (θ) < 0 for θ < θ , as is the case in Figure 2.10. The Robbins-Monroprocedure then defines a sequence of successive estimates of the root θ given byθ(N ) = θ(N −1) + aN −1 z(θ(N −1) )(2.129)where z(θ(N ) ) is an observed value of z when θ takes the value θ(N ) .
The coefficients{aN } represent a sequence of positive numbers that satisfy the conditionslim aNN →∞∞N =1∞= 0(2.130)aN= ∞(2.131)a2N< ∞.(2.132)N =1It can then be shown (Robbins and Monro, 1951; Fukunaga, 1990) that the sequenceof estimates given by (2.129) does indeed converge to the root with probability one.Note that the first condition (2.130) ensures that the successive corrections decreasein magnitude so that the process can converge to a limiting value. The second condition (2.131) is required to ensure that the algorithm does not converge short of theroot, and the third condition (2.132) is needed to ensure that the accumulated noisehas finite variance and hence does not spoil convergence.Now let us consider how a general maximum likelihood problem can be solvedsequentially using the Robbins-Monro algorithm.
By definition, the maximum likelihood solution θML is a stationary point of the log likelihood function and hencesatisfiesN1 ∂ln p(xn |θ) = 0.(2.133)∂θ Nn=1θMLExchanging the derivative and the summation, and taking the limit N → ∞ we haveN1 ∂∂limln p(xn |θ) = Exln p(x|θ)N →∞ N∂θ∂θ(2.134)n=1and so we see that finding the maximum likelihood solution corresponds to finding the root of a regression function. We can therefore apply the Robbins-Monroprocedure, which now takes the formθ(N ) = θ(N −1) + aN −1∂ln p(xN |θ(N −1) ).∂θ(N −1)(2.135)2.3.
The Gaussian DistributionFigure 2.11In the case of a Gaussian distribution, with θ zcorresponding to the mean µ, the regressionfunction illustrated in Figure 2.10 takes the formof a straight line, as shown in red. In thiscase, the random variable z corresponds to thederivative of the log likelihood function and isgiven by (x − µML )/σ 2 , and its expectation thatdefines the regression function is a straight linegiven by (µ − µML )/σ 2 . The root of the regression function corresponds to the maximum likelihood estimator µML .97p(z|µ)µMLµAs a specific example, we consider once again the sequential estimation of themean of a Gaussian distribution, in which case the parameter θ(N ) is the estimate(N )µML of the mean of the Gaussian, and the random variable z is given byz=∂1ln p(x|µML , σ 2 ) = 2 (x − µML ).∂µMLσ(2.136)Thus the distribution of z is Gaussian with mean µ − µML , as illustrated in Figure 2.11. Substituting (2.136) into (2.135), we obtain the univariate form of (2.126),provided we choose the coefficients aN to have the form aN = σ 2 /N .
Note thatalthough we have focussed on the case of a single variable, the same technique,together with the same restrictions (2.130)–(2.132) on the coefficients aN , applyequally to the multivariate case (Blum, 1965).2.3.6 Bayesian inference for the GaussianThe maximum likelihood framework gave point estimates for the parameters µand Σ. Now we develop a Bayesian treatment by introducing prior distributionsover these parameters.
Let us begin with a simple example in which we consider asingle Gaussian random variable x. We shall suppose that the variance σ 2 is known,and we consider the task of inferring the mean µ given a set of N observationsX = {x1 , . . . , xN }. The likelihood function, that is the probability of the observeddata given µ, viewed as a function of µ, is given byNN11 p(xn |µ) =exp − 2(xn − µ)2 .(2.137)p(X|µ) =2σ(2πσ 2 )N/2n=1n=1Again we emphasize that the likelihood function p(X|µ) is not a probability distribution over µ and is not normalized.We see that the likelihood function takes the form of the exponential of a quadratic form in µ.
Thus if we choose a prior p(µ) given by a Gaussian, it will be a982. PROBABILITY DISTRIBUTIONSconjugate distribution for this likelihood function because the corresponding posterior will be a product of two exponentials of quadratic functions of µ and hence willalso be Gaussian. We therefore take our prior distribution to bep(µ) = N µ|µ0 , σ02(2.138)and the posterior distribution is given byp(µ|X) ∝ p(X|µ)p(µ).Exercise 2.38(2.139)Simple manipulation involving completing the square in the exponent shows that theposterior distribution is given by2(2.140)p(µ|X) = N µ|µN , σNwhereµN=12σN=σ2N σ02µ+µML0N σ02 + σ 2N σ02 + σ 21N+ 22σ0σ(2.141)(2.142)in which µML is the maximum likelihood solution for µ given by the sample meanµML =N1 xn .N(2.143)n=1It is worth spending a moment studying the form of the posterior mean andvariance.
First of all, we note that the mean of the posterior distribution given by(2.141) is a compromise between the prior mean µ0 and the maximum likelihoodsolution µML . If the number of observed data points N = 0, then (2.141) reducesto the prior mean as expected. For N → ∞, the posterior mean is given by themaximum likelihood solution. Similarly, consider the result (2.142) for the varianceof the posterior distribution. We see that this is most naturally expressed in termsof the inverse variance, which is called the precision. Furthermore, the precisionsare additive, so that the precision of the posterior is given by the precision of theprior plus one contribution of the data precision from each of the observed datapoints. As we increase the number of observed data points, the precision steadilyincreases, corresponding to a posterior distribution with steadily decreasing variance.With no observed data points, we have the prior variance, whereas if the number of2goes to zero and the posterior distributiondata points N → ∞, the variance σNbecomes infinitely peaked around the maximum likelihood solution.
We thereforesee that the maximum likelihood result of a point estimate for µ given by (2.143) isrecovered precisely from the Bayesian formalism in the limit of an infinite numberof observations. Note also that for finite N , if we take the limit σ02 → ∞ in which theprior has infinite variance then the posterior mean (2.141) reduces to the maximum2= σ 2 /N .likelihood result, while from (2.142) the posterior variance is given by σN2.3. The Gaussian DistributionFigure 2.12Exercise 2.40Section 2.3.5Illustration of Bayesian inference for5the mean µ of a Gaussian distribution, in which the variance is assumed to be known. The curvesshow the prior distribution over µ(the curve labelled N = 0), whichin this case is itself Gaussian, alongwith the posterior distribution givenby (2.140) for increasing numbers Nof data points.
The data points aregenerated from a Gaussian of mean0.8 and variance 0.1, and the prior ischosen to have mean 0. In both theprior and the likelihood function, the 0−1variance is set to the true value.99N = 10N =2N =1N =001We illustrate our analysis of Bayesian inference for the mean of a Gaussiandistribution in Figure 2.12. The generalization of this result to the case of a Ddimensional Gaussian random variable x with known covariance and unknown meanis straightforward.We have already seen how the maximum likelihood expression for the mean ofa Gaussian can be re-cast as a sequential update formula in which the mean afterobserving N data points was expressed in terms of the mean after observing N − 1data points together with the contribution from data point xN .