Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 29
Текст из файла (страница 29)
In some circumstances, it will be convenient to remove this constraint byexpressing the distribution in terms of only M − 1 parameters. This can be achievedby using the relationship (2.209) to eliminate µM by expressing it in terms of theremaining {µk } where k = 1, . . . , M − 1, thereby leaving M − 1 parameters. Notethat these remaining parameters are still subject to the constraints0 µk 1,M−1k=1µk 1.(2.210)2.4. The Exponential Family115Making use of the constraint (2.209), the multinomial distribution in this representation then becomesMexpxk ln µkk=1= expM −1= expM −11−xk ln µk +k=1M−1k=1xk ln1−k=1We now identifyµkM −1j =1xkln 1 −µjlnM−1µkk=1+ ln 1 −M−1µk. (2.211)k=11−µkjµj= ηk(2.212)which we can solve for µk by first summing both sides over k and then rearrangingand back-substituting to giveµk =exp(ηk ).1 + j exp(ηj )(2.213)This is called the softmax function, or the normalized exponential.
In this representation, the multinomial distribution therefore takes the form−1M−1exp(ηk )exp(η T x).(2.214)p(x|η) = 1 +k=1This is the standard form of the exponential family, with parameter vector η =(η1 , . . . , ηM −1 )T in whichu(x) = xh(x) = 1g(η) =1+M−1(2.215)(2.216)−1exp(ηk ).(2.217)k=1Finally, let us consider the Gaussian distribution. For the univariate Gaussian,we have1122exp − 2 (x − µ)(2.218)p(x|µ, σ ) =2σ(2πσ 2 )1/211 2µ1 2=exp−x+x−µ(2.219)2σ 2σ22σ 2(2πσ 2 )1/21162. PROBABILITY DISTRIBUTIONSExercise 2.57which, after some simple rearrangement, can be cast in the standard exponentialfamily form (2.194) withµ/σ 2η =(2.220)−1/2σ 2 xu(x) =(2.221)x2h(x) = (2π)−1/2g(η) = (−2η2 )1/2 expη12(2.222)4η2.(2.223)2.4.1 Maximum likelihood and sufficient statisticsLet us now consider the problem of estimating the parameter vector η in the general exponential family distribution (2.194) using the technique of maximum likelihood.
Taking the gradient of both sides of (2.195) with respect to η, we have∇g(η) h(x) exp η T u(x) dx+ g(η) h(x) exp η T u(x) u(x) dx = 0.(2.224)Rearranging, and making use again of (2.195) then gives1−∇g(η) = g(η) h(x) exp η T u(x) u(x) dx = E[u(x)]g(η)(2.225)where we have used (2.194). We therefore obtain the result−∇ ln g(η) = E[u(x)].Exercise 2.58(2.226)Note that the covariance of u(x) can be expressed in terms of the second derivativesof g(η), and similarly for higher order moments.
Thus, provided we can normalize adistribution from the exponential family, we can always find its moments by simpledifferentiation.Now consider a set of independent identically distributed data denoted by X ={x1 , . . . , xn }, for which the likelihood function is given byNNh(xn ) g(η)N exp η Tu(xn ) .(2.227)p(X|η) =n=1n=1Setting the gradient of ln p(X|η) with respect to η to zero, we get the followingcondition to be satisfied by the maximum likelihood estimator η ML−∇ ln g(η ML ) =N1 u(xn )Nn=1(2.228)2.4. The Exponential Family117which can in principle be solved to obtain η ML . We see that thesolution for themaximum likelihood estimator depends on the data only through n u(xn ), whichis therefore called the sufficient statistic of the distribution (2.194).
We do not needto store the entire data set itself but only the value of the sufficient statistic. Forthe Bernoulli distribution, for example, the function u(x) is given just by x andso we need only keep the sum of the data points {xn }, whereas for the Gaussianu(x) = (x, x2 )T , and so we should keep both the sum of {xn } and the sum of {x2n }.If we consider the limit N → ∞, then the right-hand side of (2.228) becomesE[u(x)], and so by comparing with (2.226) we see that in this limit η ML will equalthe true value η.In fact, this sufficiency property holds also for Bayesian inference, althoughwe shall defer discussion of this until Chapter 8 when we have equipped ourselveswith the tools of graphical models and can thereby gain a deeper insight into theseimportant concepts.2.4.2 Conjugate priorsWe have already encountered the concept of a conjugate prior several times, forexample in the context of the Bernoulli distribution (for which the conjugate prioris the beta distribution) or the Gaussian (where the conjugate prior for the mean isa Gaussian, and the conjugate prior for the precision is the Wishart distribution).
Ingeneral, for a given probability distribution p(x|η), we can seek a prior p(η) that isconjugate to the likelihood function, so that the posterior distribution has the samefunctional form as the prior. For any member of the exponential family (2.194), thereexists a conjugate prior that can be written in the form(2.229)p(η|χ, ν) = f (χ, ν)g(η)ν exp νη T χwhere f (χ, ν) is a normalization coefficient, and g(η) is the same function as appears in (2.194).
To see that this is indeed conjugate, let us multiply the prior (2.229)by the likelihood function (2.227) to obtain the posterior distribution, up to a normalization coefficient, in the form Nu(xn ) + νχ.(2.230)p(η|X, χ, ν) ∝ g(η)ν +N exp η Tn=1This again takes the same functional form as the prior (2.229), confirming conjugacy.Furthermore, we see that the parameter ν can be interpreted as a effective number ofpseudo-observations in the prior, each of which has a value for the sufficient statisticu(x) given by χ.2.4.3 Noninformative priorsIn some applications of probabilistic inference, we may have prior knowledgethat can be conveniently expressed through the prior distribution.
For example, ifthe prior assigns zero probability to some value of variable, then the posterior distribution will necessarily also assign zero probability to that value, irrespective of1182. PROBABILITY DISTRIBUTIONSany subsequent observations of data. In many cases, however, we may have littleidea of what form the distribution should take. We may then seek a form of priordistribution, called a noninformative prior, which is intended to have as little influence on the posterior distribution as possible (Jeffries, 1946; Box and Tao, 1973;Bernardo and Smith, 1994).
This is sometimes referred to as ‘letting the data speakfor themselves’.If we have a distribution p(x|λ) governed by a parameter λ, we might be temptedto propose a prior distribution p(λ) = const as a suitable prior. If λ is a discretevariable with K states, this simply amounts to setting the prior probability of eachstate to 1/K. In the case of continuous parameters, however, there are two potentialdifficulties with this approach. The first is that, if the domain of λ is unbounded,this prior distribution cannot be correctly normalized because the integral over λdiverges. Such priors are called improper. In practice, improper priors can oftenbe used provided the corresponding posterior distribution is proper, i.e., that it canbe correctly normalized.
For instance, if we put a uniform prior distribution overthe mean of a Gaussian, then the posterior distribution for the mean, once we haveobserved at least one data point, will be proper.A second difficulty arises from the transformation behaviour of a probabilitydensity under a nonlinear change of variables, given by (1.27).
If a function h(λ)is constant, and we change variables to λ = η 2 , then h(η) = h(η 2 ) will also beconstant. However, if we choose the density pλ (λ) to be constant, then the densityof η will be given, from (1.27), by dλ = pλ (η 2 )2η ∝ η(2.231)pη (η) = pλ (λ) dη and so the density over η will not be constant. This issue does not arise when we usemaximum likelihood, because the likelihood function p(x|λ) is a simple function ofλ and so we are free to use any convenient parameterization. If, however, we are tochoose a prior distribution that is constant, we must take care to use an appropriaterepresentation for the parameters.Here we consider two simple examples of noninformative priors (Berger, 1985).First of all, if a density takes the formp(x|µ) = f (x − µ)(2.232)then the parameter µ is known as a location parameter. This family of densitiesx = x + c,exhibits translation invariance because if we shift x by a constant to give thenx|µ) = f (x−µ)(2.233)p( = µ + c.
Thus the density takes the same form in thewhere we have defined µnew variable as in the original one, and so the density is independent of the choiceof origin. We would like to choose a prior distribution that reflects this translationinvariance property, and so we choose a prior that assigns equal probability mass to2.4. The Exponential Family119an interval A µ B as to the shifted interval A − c µ B − c. This implies B B−c Bp(µ) dµ =p(µ) dµ =p(µ − c) dµ(2.234)AA−cAand because this must hold for all choices of A and B, we havep(µ − c) = p(µ)(2.235)which implies that p(µ) is constant. An example of a location parameter would bethe mean µ of a Gaussian distribution. As we have seen, the conjugate prior distribution for µ in this case is a Gaussian p(µ|µ0 , σ02 ) = N (µ|µ0 , σ02 ), and we obtain anoninformative prior by taking the limit σ02 → ∞.
Indeed, from (2.141) and (2.142)we see that this gives a posterior distribution over µ in which the contributions fromthe prior vanish.As a second example, consider a density of the form1 xp(x|σ) = f(2.236)σσExercise 2.59where σ > 0. Note that this will be a normalized density provided f (x) is correctlynormalized. The parameter σ is known as a scale parameter, and the density exhibitsx = cx, thenscale invariance because if we scale x by a constant to give 1xx|σ) = f(2.237)p(σσwhere we have defined σ = cσ. This transformation corresponds to a change ofscale, for example from meters to kilometers if x is a length, and we would liketo choose a prior distribution that reflects this scale invariance.