Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 21

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 21 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 212020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 21)

Indeed, it is simply another beta distribution, and its normalizationcoefﬁcient can therefore be obtained by comparison with (2.13) to givep(µ|m, l, a, b) =Γ(m + a + l + b) m+a−1(1 − µ)l+b−1 .µΓ(m + a)Γ(l + b)(2.18)We see that the effect of observing a data set of m observations of x = 1 andl observations of x = 0 has been to increase the value of a by m, and the value ofb by l, in going from the prior distribution to the posterior distribution. This allowsus to provide a simple interpretation of the hyperparameters a and b in the prior asan effective number of observations of x = 1 and x = 0, respectively. Note thata and b need not be integers. Furthermore, the posterior distribution can act as theprior if we subsequently observe additional data. To see this, we can imagine takingobservations one at a time and after each observation updating the current posterior732.1.

Binary Variables22prior102likelihood function100.5µ10posterior100.5µ0100.5µ1Figure 2.3 Illustration of one step of sequential Bayesian inference. The prior is given by a beta distributionwith parameters a = 2, b = 2, and the likelihood function, given by (2.9) with N = m = 1, corresponds to asingle observation of x = 1, so that the posterior is given by a beta distribution with parameters a = 3, b = 2.Section 2.3.5distribution by multiplying by the likelihood function for the new observation andthen normalizing to obtain the new, revised posterior distribution. At each stage, theposterior is a beta distribution with some total number of (prior and actual) observedvalues for x = 1 and x = 0 given by the parameters a and b.

Incorporation of anadditional observation of x = 1 simply corresponds to incrementing the value of aby 1, whereas for an observation of x = 0 we increment b by 1. Figure 2.3 illustratesone step in this process.We see that this sequential approach to learning arises naturally when we adopta Bayesian viewpoint. It is independent of the choice of prior and of the likelihoodfunction and depends only on the assumption of i.i.d. data. Sequential methods makeuse of observations one at a time, or in small batches, and then discard them beforethe next observations are used.

They can be used, for example, in real-time learningscenarios where a steady stream of data is arriving, and predictions must be madebefore all of the data is seen. Because they do not require the whole data set to bestored or loaded into memory, sequential methods are also useful for large data sets.Maximum likelihood methods can also be cast into a sequential framework.If our goal is to predict, as best we can, the outcome of the next trial, then wemust evaluate the predictive distribution of x, given the observed data set D.

Fromthe sum and product rules of probability, this takes the form 1 1p(x = 1|µ)p(µ|D) dµ =µp(µ|D) dµ = E[µ|D]. (2.19)p(x = 1|D) =00Using the result (2.18) for the posterior distribution p(µ|D), together with the result(2.15) for the mean of the beta distribution, we obtainp(x = 1|D) =m+am+a+l+b(2.20)which has a simple interpretation as the total fraction of observations (both real observations and ﬁctitious prior observations) that correspond to x = 1. Note that inthe limit of an inﬁnitely large data set m, l → ∞ the result (2.20) reduces to themaximum likelihood result (2.8).

As we shall see, it is a very general property thatthe Bayesian and maximum likelihood results will agree in the limit of an inﬁnitely742. PROBABILITY DISTRIBUTIONSExercise 2.7Exercise 2.8large data set. For a ﬁnite data set, the posterior mean for µ always lies between theprior mean and the maximum likelihood estimate for µ corresponding to the relativefrequencies of events given by (2.7).From Figure 2.2, we see that as the number of observations increases, so theposterior distribution becomes more sharply peaked.

This can also be seen fromthe result (2.16) for the variance of the beta distribution, in which we see that thevariance goes to zero for a → ∞ or b → ∞. In fact, we might wonder whether it isa general property of Bayesian learning that, as we observe more and more data, theuncertainty represented by the posterior distribution will steadily decrease.To address this, we can take a frequentist view of Bayesian learning and showthat, on average, such a property does indeed hold.

Consider a general Bayesianinference problem for a parameter θ for which we have observed a data set D, described by the joint distribution p(θ, D). The following resultEθ [θ] = ED [Eθ [θ|D]]where(2.21)Eθ [θ] ≡p(θ)θ dθ ED [Eθ [θ|D]] ≡θp(θ|D) dθ p(D) dD(2.22)(2.23)says that the posterior mean of θ, averaged over the distribution generating the data,is equal to the prior mean of θ. Similarly, we can show thatvarθ [θ] = ED [varθ [θ|D]] + varD [Eθ [θ|D]] .(2.24)The term on the left-hand side of (2.24) is the prior variance of θ. On the righthand side, the ﬁrst term is the average posterior variance of θ, and the second termmeasures the variance in the posterior mean of θ. Because this variance is a positivequantity, this result shows that, on average, the posterior variance of θ is smaller thanthe prior variance.

The reduction in variance is greater if the variance in the posteriormean is greater. Note, however, that this result only holds on average, and that for aparticular observed data set it is possible for the posterior variance to be larger thanthe prior variance.2.2. Multinomial VariablesBinary variables can be used to describe quantities that can take one of two possiblevalues. Often, however, we encounter discrete variables that can take on one of Kpossible mutually exclusive states.

Although there are various alternative ways toexpress such variables, we shall see shortly that a particularly convenient representation is the 1-of-K scheme in which the variable is represented by a K-dimensionalvector x in which one of the elements xk equals 1, and all remaining elements equal2.2. Multinomial Variables750. So, for instance if we have a variable that can take K = 6 states and a particularobservation of the variable happens to correspond to the state where x3 = 1, then xwill be represented by(2.25)x = (0, 0, 1, 0, 0, 0)T .KNote that such vectors satisfy k=1 xk = 1. If we denote the probability of xk = 1by the parameter µk , then the distribution of x is givenp(x|µ) =Kµxkk(2.26)k=1Twhereµ = (µ1 , . . . , µK ) , and the parameters µk are constrained to satisfy µk 0and k µk = 1, because they represent probabilities.

The distribution (2.26) can beregarded as a generalization of the Bernoulli distribution to more than two outcomes.It is easily seen that the distribution is normalizedp(x|µ) =xand thatE[x|µ] =Kµk = 1(2.27)k=1p(x|µ)x = (µ1 , . . . , µM )T = µ.(2.28)xNow consider a data set D of N independent observations x1 , .

. . , xN . Thecorresponding likelihood function takes the formp(D|µ) =N Kn=1 k=1µxknkKKP( n xnk ) mk=µk=µk .k=1(2.29)k=1We see that the likelihood function depends on the N data points only through theK quantitiesxnk(2.30)mk =nSection 2.4Appendix Ewhich represent the number of observations of xk = 1. These are called the sufﬁcientstatistics for this distribution.In order to ﬁnd the maximum likelihood solution for µ, we need to maximizeln p(D|µ) with respect to µk taking account of the constraint that the µk must sumto one. This can be achieved using a Lagrange multiplier λ and maximizingKKmk ln µk + λµk − 1 .(2.31)k=1k=1Setting the derivative of (2.31) with respect to µk to zero, we obtainµk = −mk /λ.(2.32)762.

PROBABILITY DISTRIBUTIONSWe can solve for the Lagrange multiplier λ by substituting (2.32) into the constraintk µk = 1 to give λ = −N . Thus we obtain the maximum likelihood solution inthe formmk=(2.33)µMLkNwhich is the fraction of the N observations for which xk = 1.We can consider the joint distribution of the quantities m1 , . . . , mK , conditionedon the parameters µ and on the total number N of observations. From (2.29) thistakes the formKNkMult(m1 , m2 , .

. . , mK |µ, N ) =µm(2.34)km1 m2 . . . mKk=1which is known as the multinomial distribution. The normalization coefﬁcient is thenumber of ways of partitioning N objects into K groups of size m1 , . . . , mK and isgiven byNN!.(2.35)=m1 m2 . . . mKm1 !m2 ! . . . mK !Note that the variables mk are subject to the constraintKmk = N.(2.36)k=12.2.1 The Dirichlet distributionWe now introduce a family of prior distributions for the parameters {µk } ofthe multinomial distribution (2.34).

By inspection of the form of the multinomialdistribution, we see that the conjugate prior is given byp(µ|α) ∝Exercise 2.9Kk −1µαk(2.37)k=1where 0 µk 1 and k µk = 1. Here α1 , . . . , αK are the parameters of thedistribution, and α denotes (α1 , . . . , αK )T . Note that, because of the summationconstraint, the distribution over the space of the {µk } is conﬁned to a simplex ofdimensionality K − 1, as illustrated for K = 3 in Figure 2.4.The normalized form for this distribution is byKΓ(α0 )k −1µαDir(µ|α) =kΓ(α1 ) · · · Γ(αK )(2.38)k=1which is called the Dirichlet distribution. Here Γ(x) is the gamma function deﬁnedby (1.141) whileKα0 =αk .(2.39)k=12.2. Multinomial VariablesFigure 2.477µ2The Dirichlet distribution over three variables µ1 , µ2 , µ3is conﬁned to a simplex (a bounded linear manifold) ofthe form shown,Pas a consequence of the constraints0 µk 1 and k µk = 1.µ1µ3Plots of the Dirichlet distribution over the simplex, for various settings of the parameters αk , are shown in Figure 2.5.Multiplying the prior (2.38) by the likelihood function (2.34), we obtain theposterior distribution for the parameters {µk } in the formp(µ|D, α) ∝ p(D|µ)p(µ|α) ∝Kk +mk −1µα.k(2.40)k=1We see that the posterior distribution again takes the form of a Dirichlet distribution,conﬁrming that the Dirichlet is indeed a conjugate prior for the multinomial.

Thisallows us to determine the normalization coefﬁcient by comparison with (2.38) sothatp(µ|D, α) = Dir(µ|α + m)=KΓ(α0 + N )k +mk −1µαkΓ(α1 + m1 ) · · · Γ(αK + mK )(2.41)k=1where we have denoted m = (m1 , . . . , mK )T . As for the case of the binomialdistribution with its beta prior, we can interpret the parameters αk of the Dirichletprior as an effective number of observations of xk = 1.Note that two-state quantities can either be represented as binary variables andLejeune Dirichlet1805–1859Johann Peter Gustav LejeuneDirichlet was a modest and reserved mathematician who madecontributions in number theory, mechanics, and astronomy, and whogave the ﬁrst rigorous analysis ofFourier series.

His family originated from Richeletin Belgium, and the name Lejeune Dirichlet comesfrom ‘le jeune de Richelet’ (the young person fromRichelet). Dirichlet’s ﬁrst paper, which was publishedin 1825, brought him instant fame. It concerned Fermat’s last theorem, which claims that there are nopositive integer solutions to xn + y n = z n for n > 2.Dirichlet gave a partial proof for the case n = 5, whichwas sent to Legendre for review and who in turn completed the proof. Later, Dirichlet gave a complete prooffor n = 14, although a full proof of Fermat’s last theorem for arbitrary n had to wait until the work of AndrewWiles in the closing years of the 20th century.782. PROBABILITY DISTRIBUTIONSFigure 2.5 Plots of the Dirichlet distribution over three variables, where the two horizontal axes are coordinatesin the plane of the simplex and the vertical axis corresponds to the value of the density.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.