Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 21
Текст из файла (страница 21)
Indeed, it is simply another beta distribution, and its normalizationcoefficient can therefore be obtained by comparison with (2.13) to givep(µ|m, l, a, b) =Γ(m + a + l + b) m+a−1(1 − µ)l+b−1 .µΓ(m + a)Γ(l + b)(2.18)We see that the effect of observing a data set of m observations of x = 1 andl observations of x = 0 has been to increase the value of a by m, and the value ofb by l, in going from the prior distribution to the posterior distribution. This allowsus to provide a simple interpretation of the hyperparameters a and b in the prior asan effective number of observations of x = 1 and x = 0, respectively. Note thata and b need not be integers. Furthermore, the posterior distribution can act as theprior if we subsequently observe additional data. To see this, we can imagine takingobservations one at a time and after each observation updating the current posterior732.1.
Binary Variables22prior102likelihood function100.5µ10posterior100.5µ0100.5µ1Figure 2.3 Illustration of one step of sequential Bayesian inference. The prior is given by a beta distributionwith parameters a = 2, b = 2, and the likelihood function, given by (2.9) with N = m = 1, corresponds to asingle observation of x = 1, so that the posterior is given by a beta distribution with parameters a = 3, b = 2.Section 2.3.5distribution by multiplying by the likelihood function for the new observation andthen normalizing to obtain the new, revised posterior distribution. At each stage, theposterior is a beta distribution with some total number of (prior and actual) observedvalues for x = 1 and x = 0 given by the parameters a and b.
Incorporation of anadditional observation of x = 1 simply corresponds to incrementing the value of aby 1, whereas for an observation of x = 0 we increment b by 1. Figure 2.3 illustratesone step in this process.We see that this sequential approach to learning arises naturally when we adopta Bayesian viewpoint. It is independent of the choice of prior and of the likelihoodfunction and depends only on the assumption of i.i.d. data. Sequential methods makeuse of observations one at a time, or in small batches, and then discard them beforethe next observations are used.
They can be used, for example, in real-time learningscenarios where a steady stream of data is arriving, and predictions must be madebefore all of the data is seen. Because they do not require the whole data set to bestored or loaded into memory, sequential methods are also useful for large data sets.Maximum likelihood methods can also be cast into a sequential framework.If our goal is to predict, as best we can, the outcome of the next trial, then wemust evaluate the predictive distribution of x, given the observed data set D.
Fromthe sum and product rules of probability, this takes the form 1 1p(x = 1|µ)p(µ|D) dµ =µp(µ|D) dµ = E[µ|D]. (2.19)p(x = 1|D) =00Using the result (2.18) for the posterior distribution p(µ|D), together with the result(2.15) for the mean of the beta distribution, we obtainp(x = 1|D) =m+am+a+l+b(2.20)which has a simple interpretation as the total fraction of observations (both real observations and fictitious prior observations) that correspond to x = 1. Note that inthe limit of an infinitely large data set m, l → ∞ the result (2.20) reduces to themaximum likelihood result (2.8).
As we shall see, it is a very general property thatthe Bayesian and maximum likelihood results will agree in the limit of an infinitely742. PROBABILITY DISTRIBUTIONSExercise 2.7Exercise 2.8large data set. For a finite data set, the posterior mean for µ always lies between theprior mean and the maximum likelihood estimate for µ corresponding to the relativefrequencies of events given by (2.7).From Figure 2.2, we see that as the number of observations increases, so theposterior distribution becomes more sharply peaked.
This can also be seen fromthe result (2.16) for the variance of the beta distribution, in which we see that thevariance goes to zero for a → ∞ or b → ∞. In fact, we might wonder whether it isa general property of Bayesian learning that, as we observe more and more data, theuncertainty represented by the posterior distribution will steadily decrease.To address this, we can take a frequentist view of Bayesian learning and showthat, on average, such a property does indeed hold.
Consider a general Bayesianinference problem for a parameter θ for which we have observed a data set D, described by the joint distribution p(θ, D). The following resultEθ [θ] = ED [Eθ [θ|D]]where(2.21)Eθ [θ] ≡p(θ)θ dθ ED [Eθ [θ|D]] ≡θp(θ|D) dθ p(D) dD(2.22)(2.23)says that the posterior mean of θ, averaged over the distribution generating the data,is equal to the prior mean of θ. Similarly, we can show thatvarθ [θ] = ED [varθ [θ|D]] + varD [Eθ [θ|D]] .(2.24)The term on the left-hand side of (2.24) is the prior variance of θ. On the righthand side, the first term is the average posterior variance of θ, and the second termmeasures the variance in the posterior mean of θ. Because this variance is a positivequantity, this result shows that, on average, the posterior variance of θ is smaller thanthe prior variance.
The reduction in variance is greater if the variance in the posteriormean is greater. Note, however, that this result only holds on average, and that for aparticular observed data set it is possible for the posterior variance to be larger thanthe prior variance.2.2. Multinomial VariablesBinary variables can be used to describe quantities that can take one of two possiblevalues. Often, however, we encounter discrete variables that can take on one of Kpossible mutually exclusive states.
Although there are various alternative ways toexpress such variables, we shall see shortly that a particularly convenient representation is the 1-of-K scheme in which the variable is represented by a K-dimensionalvector x in which one of the elements xk equals 1, and all remaining elements equal2.2. Multinomial Variables750. So, for instance if we have a variable that can take K = 6 states and a particularobservation of the variable happens to correspond to the state where x3 = 1, then xwill be represented by(2.25)x = (0, 0, 1, 0, 0, 0)T .KNote that such vectors satisfy k=1 xk = 1. If we denote the probability of xk = 1by the parameter µk , then the distribution of x is givenp(x|µ) =Kµxkk(2.26)k=1Twhereµ = (µ1 , . . . , µK ) , and the parameters µk are constrained to satisfy µk 0and k µk = 1, because they represent probabilities.
The distribution (2.26) can beregarded as a generalization of the Bernoulli distribution to more than two outcomes.It is easily seen that the distribution is normalizedp(x|µ) =xand thatE[x|µ] =Kµk = 1(2.27)k=1p(x|µ)x = (µ1 , . . . , µM )T = µ.(2.28)xNow consider a data set D of N independent observations x1 , .
. . , xN . Thecorresponding likelihood function takes the formp(D|µ) =N Kn=1 k=1µxknkKKP( n xnk ) mk=µk=µk .k=1(2.29)k=1We see that the likelihood function depends on the N data points only through theK quantitiesxnk(2.30)mk =nSection 2.4Appendix Ewhich represent the number of observations of xk = 1. These are called the sufficientstatistics for this distribution.In order to find the maximum likelihood solution for µ, we need to maximizeln p(D|µ) with respect to µk taking account of the constraint that the µk must sumto one. This can be achieved using a Lagrange multiplier λ and maximizingKKmk ln µk + λµk − 1 .(2.31)k=1k=1Setting the derivative of (2.31) with respect to µk to zero, we obtainµk = −mk /λ.(2.32)762.
PROBABILITY DISTRIBUTIONSWe can solve for the Lagrange multiplier λ by substituting (2.32) into the constraintk µk = 1 to give λ = −N . Thus we obtain the maximum likelihood solution inthe formmk=(2.33)µMLkNwhich is the fraction of the N observations for which xk = 1.We can consider the joint distribution of the quantities m1 , . . . , mK , conditionedon the parameters µ and on the total number N of observations. From (2.29) thistakes the formKNkMult(m1 , m2 , .
. . , mK |µ, N ) =µm(2.34)km1 m2 . . . mKk=1which is known as the multinomial distribution. The normalization coefficient is thenumber of ways of partitioning N objects into K groups of size m1 , . . . , mK and isgiven byNN!.(2.35)=m1 m2 . . . mKm1 !m2 ! . . . mK !Note that the variables mk are subject to the constraintKmk = N.(2.36)k=12.2.1 The Dirichlet distributionWe now introduce a family of prior distributions for the parameters {µk } ofthe multinomial distribution (2.34).
By inspection of the form of the multinomialdistribution, we see that the conjugate prior is given byp(µ|α) ∝Exercise 2.9Kk −1µαk(2.37)k=1where 0 µk 1 and k µk = 1. Here α1 , . . . , αK are the parameters of thedistribution, and α denotes (α1 , . . . , αK )T . Note that, because of the summationconstraint, the distribution over the space of the {µk } is confined to a simplex ofdimensionality K − 1, as illustrated for K = 3 in Figure 2.4.The normalized form for this distribution is byKΓ(α0 )k −1µαDir(µ|α) =kΓ(α1 ) · · · Γ(αK )(2.38)k=1which is called the Dirichlet distribution. Here Γ(x) is the gamma function definedby (1.141) whileKα0 =αk .(2.39)k=12.2. Multinomial VariablesFigure 2.477µ2The Dirichlet distribution over three variables µ1 , µ2 , µ3is confined to a simplex (a bounded linear manifold) ofthe form shown,Pas a consequence of the constraints0 µk 1 and k µk = 1.µ1µ3Plots of the Dirichlet distribution over the simplex, for various settings of the parameters αk , are shown in Figure 2.5.Multiplying the prior (2.38) by the likelihood function (2.34), we obtain theposterior distribution for the parameters {µk } in the formp(µ|D, α) ∝ p(D|µ)p(µ|α) ∝Kk +mk −1µα.k(2.40)k=1We see that the posterior distribution again takes the form of a Dirichlet distribution,confirming that the Dirichlet is indeed a conjugate prior for the multinomial.
Thisallows us to determine the normalization coefficient by comparison with (2.38) sothatp(µ|D, α) = Dir(µ|α + m)=KΓ(α0 + N )k +mk −1µαkΓ(α1 + m1 ) · · · Γ(αK + mK )(2.41)k=1where we have denoted m = (m1 , . . . , mK )T . As for the case of the binomialdistribution with its beta prior, we can interpret the parameters αk of the Dirichletprior as an effective number of observations of xk = 1.Note that two-state quantities can either be represented as binary variables andLejeune Dirichlet1805–1859Johann Peter Gustav LejeuneDirichlet was a modest and reserved mathematician who madecontributions in number theory, mechanics, and astronomy, and whogave the first rigorous analysis ofFourier series.
His family originated from Richeletin Belgium, and the name Lejeune Dirichlet comesfrom ‘le jeune de Richelet’ (the young person fromRichelet). Dirichlet’s first paper, which was publishedin 1825, brought him instant fame. It concerned Fermat’s last theorem, which claims that there are nopositive integer solutions to xn + y n = z n for n > 2.Dirichlet gave a partial proof for the case n = 5, whichwas sent to Legendre for review and who in turn completed the proof. Later, Dirichlet gave a complete prooffor n = 14, although a full proof of Fermat’s last theorem for arbitrary n had to wait until the work of AndrewWiles in the closing years of the 20th century.782. PROBABILITY DISTRIBUTIONSFigure 2.5 Plots of the Dirichlet distribution over three variables, where the two horizontal axes are coordinatesin the plane of the simplex and the vertical axis corresponds to the value of the density.