Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 25

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 25 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 252020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 25)

Using(C.19), the derivative of the log likelihood with respect to µ is given by∂ln p(X|µ, Σ) =Σ−1 (xn − µ)∂µN(2.120)n=1and setting this derivative to zero, we obtain the solution for the maximum likelihoodestimate of the mean given byµML =N1 xnNn=1(2.121)942. PROBABILITY DISTRIBUTIONSExercise 2.34which is the mean of the observed set of data points. The maximization of (2.118)with respect to Σ is rather more involved.

The simplest approach is to ignore thesymmetry constraint and show that the resulting solution is symmetric as required.Alternative derivations of this result, which impose the symmetry and positive deﬁniteness constraints explicitly, can be found in Magnus and Neudecker (1999). Theresult is as expected and takes the formΣMLN1 =(xn − µML )(xn − µML )TN(2.122)n=1Exercise 2.35which involves µML because this is the result of a joint maximization with respectto µ and Σ. Note that the solution (2.121) for µML does not depend on ΣML , and sowe can ﬁrst evaluate µML and then use this to evaluate ΣML .If we evaluate the expectations of the maximum likelihood solutions under thetrue distribution, we obtain the following resultsE[µML ] = µN −1Σ.E[ΣML ] =N(2.123)(2.124)We see that the expectation of the maximum likelihood estimate for the mean is equalto the true mean.

However, the maximum likelihood estimate for the covariance hasan expectation that is less than the true value, and hence it is biased. We can correct given bythis bias by deﬁning a different estimator Σ=Σ1 (xn − µML )(xn − µML )T .N −1N(2.125)n=1 is equal to Σ.Clearly from (2.122) and (2.124), the expectation of Σ2.3.5 Sequential estimationOur discussion of the maximum likelihood solution for the parameters of a Gaussian distribution provides a convenient opportunity to give a more general discussionof the topic of sequential estimation for maximum likelihood. Sequential methodsallow data points to be processed one at a time and then discarded and are importantfor on-line applications, and also where large data sets are involved so that batchprocessing of all data points at once is infeasible.Consider the result (2.121) for the maximum likelihood estimator of the mean(N )µML , which we will denote by µML when it is based on N observations. If we2.3.

The Gaussian DistributionFigure 2.10A schematic illustration of two correlated random variables z and θ, together with theregression function f (θ) given by the conditional expectation E[z|θ]. The RobbinsMonro algorithm provides a general sequential procedure for ﬁnding the root θ of suchfunctions.95zf (θ)θθdissect out the contribution from the ﬁnal data point xN , we obtain(N )µML=N1 xnNn=1=N −111 xnxN +NNn=11N − 1 (N −1)=xN +µMLNN1(N −1)(N −1)= µML + (xN − µML ).N(2.126)This result has a nice interpretation, as follows.

After observing N − 1 data points(N −1)we have estimated µ by µML . We now observe data point xN , and we obtain our(N )revised estimate µML by moving the old estimate a small amount, proportional to(N −1)1/N , in the direction of the ‘error signal’ (xN − µML ).

Note that, as N increases,so the contribution from successive data points gets smaller.The result (2.126) will clearly give the same answer as the batch result (2.121)because the two formulae are equivalent. However, we will not always be able to derive a sequential algorithm by this route, and so we seek a more general formulationof sequential learning, which leads us to the Robbins-Monro algorithm. Consider apair of random variables θ and z governed by a joint distribution p(z, θ). The conditional expectation of z given θ deﬁnes a deterministic function f (θ) that is givenby(2.127)f (θ) ≡ E[z|θ] = zp(z|θ) dzand is illustrated schematically in Figure 2.10.

Functions deﬁned in this way arecalled regression functions.Our goal is to ﬁnd the root θ at which f (θ ) = 0. If we had a large data setof observations of z and θ, then we could model the regression function directly andthen obtain an estimate of its root. Suppose, however, that we observe values ofz one at a time and we wish to ﬁnd a corresponding sequential estimation schemefor θ . The following general procedure for solving such problems was given by962. PROBABILITY DISTRIBUTIONSRobbins and Monro (1951).

We shall assume that the conditional variance of z isﬁnite so thatE (z − f )2 | θ < ∞(2.128)and we shall also, without loss of generality, consider the case where f (θ) > 0 forθ > θ and f (θ) < 0 for θ < θ , as is the case in Figure 2.10. The Robbins-Monroprocedure then deﬁnes a sequence of successive estimates of the root θ given byθ(N ) = θ(N −1) + aN −1 z(θ(N −1) )(2.129)where z(θ(N ) ) is an observed value of z when θ takes the value θ(N ) .

The coefﬁcients{aN } represent a sequence of positive numbers that satisfy the conditionslim aNN →∞∞N =1∞= 0(2.130)aN= ∞(2.131)a2N< ∞.(2.132)N =1It can then be shown (Robbins and Monro, 1951; Fukunaga, 1990) that the sequenceof estimates given by (2.129) does indeed converge to the root with probability one.Note that the ﬁrst condition (2.130) ensures that the successive corrections decreasein magnitude so that the process can converge to a limiting value. The second condition (2.131) is required to ensure that the algorithm does not converge short of theroot, and the third condition (2.132) is needed to ensure that the accumulated noisehas ﬁnite variance and hence does not spoil convergence.Now let us consider how a general maximum likelihood problem can be solvedsequentially using the Robbins-Monro algorithm.

By deﬁnition, the maximum likelihood solution θML is a stationary point of the log likelihood function and hencesatisﬁesN1 ∂ln p(xn |θ) = 0.(2.133)∂θ Nn=1θMLExchanging the derivative and the summation, and taking the limit N → ∞ we haveN1 ∂∂limln p(xn |θ) = Exln p(x|θ)N →∞ N∂θ∂θ(2.134)n=1and so we see that ﬁnding the maximum likelihood solution corresponds to ﬁnding the root of a regression function. We can therefore apply the Robbins-Monroprocedure, which now takes the formθ(N ) = θ(N −1) + aN −1∂ln p(xN |θ(N −1) ).∂θ(N −1)(2.135)2.3.

The Gaussian DistributionFigure 2.11In the case of a Gaussian distribution, with θ zcorresponding to the mean µ, the regressionfunction illustrated in Figure 2.10 takes the formof a straight line, as shown in red. In thiscase, the random variable z corresponds to thederivative of the log likelihood function and isgiven by (x − µML )/σ 2 , and its expectation thatdeﬁnes the regression function is a straight linegiven by (µ − µML )/σ 2 . The root of the regression function corresponds to the maximum likelihood estimator µML .97p(z|µ)µMLµAs a speciﬁc example, we consider once again the sequential estimation of themean of a Gaussian distribution, in which case the parameter θ(N ) is the estimate(N )µML of the mean of the Gaussian, and the random variable z is given byz=∂1ln p(x|µML , σ 2 ) = 2 (x − µML ).∂µMLσ(2.136)Thus the distribution of z is Gaussian with mean µ − µML , as illustrated in Figure 2.11. Substituting (2.136) into (2.135), we obtain the univariate form of (2.126),provided we choose the coefﬁcients aN to have the form aN = σ 2 /N .

Note thatalthough we have focussed on the case of a single variable, the same technique,together with the same restrictions (2.130)–(2.132) on the coefﬁcients aN , applyequally to the multivariate case (Blum, 1965).2.3.6 Bayesian inference for the GaussianThe maximum likelihood framework gave point estimates for the parameters µand Σ. Now we develop a Bayesian treatment by introducing prior distributionsover these parameters.

Let us begin with a simple example in which we consider asingle Gaussian random variable x. We shall suppose that the variance σ 2 is known,and we consider the task of inferring the mean µ given a set of N observationsX = {x1 , . . . , xN }. The likelihood function, that is the probability of the observeddata given µ, viewed as a function of µ, is given byNN11 p(xn |µ) =exp − 2(xn − µ)2 .(2.137)p(X|µ) =2σ(2πσ 2 )N/2n=1n=1Again we emphasize that the likelihood function p(X|µ) is not a probability distribution over µ and is not normalized.We see that the likelihood function takes the form of the exponential of a quadratic form in µ.

Thus if we choose a prior p(µ) given by a Gaussian, it will be a982. PROBABILITY DISTRIBUTIONSconjugate distribution for this likelihood function because the corresponding posterior will be a product of two exponentials of quadratic functions of µ and hence willalso be Gaussian. We therefore take our prior distribution to bep(µ) = N µ|µ0 , σ02(2.138)and the posterior distribution is given byp(µ|X) ∝ p(X|µ)p(µ).Exercise 2.38(2.139)Simple manipulation involving completing the square in the exponent shows that theposterior distribution is given by2(2.140)p(µ|X) = N µ|µN , σNwhereµN=12σN=σ2N σ02µ+µML0N σ02 + σ 2N σ02 + σ 21N+ 22σ0σ(2.141)(2.142)in which µML is the maximum likelihood solution for µ given by the sample meanµML =N1 xn .N(2.143)n=1It is worth spending a moment studying the form of the posterior mean andvariance.

First of all, we note that the mean of the posterior distribution given by(2.141) is a compromise between the prior mean µ0 and the maximum likelihoodsolution µML . If the number of observed data points N = 0, then (2.141) reducesto the prior mean as expected. For N → ∞, the posterior mean is given by themaximum likelihood solution. Similarly, consider the result (2.142) for the varianceof the posterior distribution. We see that this is most naturally expressed in termsof the inverse variance, which is called the precision. Furthermore, the precisionsare additive, so that the precision of the posterior is given by the precision of theprior plus one contribution of the data precision from each of the observed datapoints. As we increase the number of observed data points, the precision steadilyincreases, corresponding to a posterior distribution with steadily decreasing variance.With no observed data points, we have the prior variance, whereas if the number of2goes to zero and the posterior distributiondata points N → ∞, the variance σNbecomes inﬁnitely peaked around the maximum likelihood solution.

We thereforesee that the maximum likelihood result of a point estimate for µ given by (2.143) isrecovered precisely from the Bayesian formalism in the limit of an inﬁnite numberof observations. Note also that for ﬁnite N , if we take the limit σ02 → ∞ in which theprior has inﬁnite variance then the posterior mean (2.141) reduces to the maximum2= σ 2 /N .likelihood result, while from (2.142) the posterior variance is given by σN2.3. The Gaussian DistributionFigure 2.12Exercise 2.40Section 2.3.5Illustration of Bayesian inference for5the mean µ of a Gaussian distribution, in which the variance is assumed to be known. The curvesshow the prior distribution over µ(the curve labelled N = 0), whichin this case is itself Gaussian, alongwith the posterior distribution givenby (2.140) for increasing numbers Nof data points.

The data points aregenerated from a Gaussian of mean0.8 and variance 0.1, and the prior ischosen to have mean 0. In both theprior and the likelihood function, the 0−1variance is set to the true value.99N = 10N =2N =1N =001We illustrate our analysis of Bayesian inference for the mean of a Gaussiandistribution in Figure 2.12. The generalization of this result to the case of a Ddimensional Gaussian random variable x with known covariance and unknown meanis straightforward.We have already seen how the maximum likelihood expression for the mean ofa Gaussian can be re-cast as a sequential update formula in which the mean afterobserving N data points was expressed in terms of the mean after observing N − 1data points together with the contribution from data point xN .

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.