Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 40

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 40 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 402020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 40)

If we assume that the posterior distribution is sharply peaked around themost probable value wMAP , with width ∆wposterior , then we can approximate the integral by the value of the integrand at its maximum times the width of the peak. If wefurther assume that the prior is ﬂat with width ∆wprior so that p(w) = 1/∆wprior ,then we have∆wposterior(3.70)p(D) = p(D|w)p(w) dw p(D|wMAP )∆wprior3.4.

Bayesian Model ComparisonFigure 3.12163∆wposteriorWe can obtain a rough approximation tothe model evidence if we assume thatthe posterior distribution over parameters is sharply peaked around its modewMAP .wwMAP∆wpriorand so taking logs we obtainln p(D) ln p(D|wMAP ) + ln∆wposterior∆wprior.(3.71)This approximation is illustrated in Figure 3.12. The ﬁrst term represents the ﬁt tothe data given by the most probable parameter values, and for a ﬂat prior this wouldcorrespond to the log likelihood.

The second term penalizes the model according toits complexity. Because ∆wposterior < ∆wprior this term is negative, and it increasesin magnitude as the ratio ∆wposterior /∆wprior gets smaller. Thus, if parameters areﬁnely tuned to the data in the posterior distribution, then the penalty term is large.For a model having a set of M parameters, we can make a similar approximationfor each parameter in turn. Assuming that all parameters have the same ratio of∆wposterior /∆wprior , we obtain∆wposteriorln p(D) ln p(D|wMAP ) + M ln.(3.72)∆wpriorSection 4.4.1Thus, in this very simple approximation, the size of the complexity penalty increaseslinearly with the number M of adaptive parameters in the model.

As we increasethe complexity of the model, the ﬁrst term will typically decrease, because a morecomplex model is better able to ﬁt the data, whereas the second term will increasedue to the dependence on M . The optimal model complexity, as determined bythe maximum evidence, will be given by a trade-off between these two competingterms. We shall later develop a more reﬁned version of this approximation, based ona Gaussian approximation to the posterior distribution.We can gain further insight into Bayesian model comparison and understandhow the marginal likelihood can favour models of intermediate complexity by considering Figure 3.13.

Here the horizontal axis is a one-dimensional representationof the space of possible data sets, so that each point on this axis corresponds to aspeciﬁc data set. We now consider three models M1 , M2 and M3 of successivelyincreasing complexity. Imagine running these models generatively to produce example data sets, and then looking at the distribution of data sets that result. Any given1643. LINEAR MODELS FOR REGRESSIONFigure 3.13Schematic illustration of thedistribution of data sets for p(D)three models of different complexity, in which M1 is thesimplest and M3 is the mostcomplex.

Note that the distributions are normalized. Inthis example, for the particular observed data set D0 ,the model M2 with intermediate complexity has the largestevidence.M1M2M3D0Dmodel can generate a variety of different data sets since the parameters are governedby a prior probability distribution, and for any choice of the parameters there maybe random noise on the target variables. To generate a particular data set from a speciﬁc model, we ﬁrst choose the values of the parameters from their prior distributionp(w), and then for these parameter values we sample the data from p(D|w). A simple model (for example, based on a ﬁrst order polynomial) has little variability andso will generate data sets that are fairly similar to each other. Its distribution p(D)is therefore conﬁned to a relatively small region of the horizontal axis.

By contrast,a complex model (such as a ninth order polynomial) can generate a great variety ofdifferent data sets, and so its distribution p(D) is spread over a large region of thespace of data sets. Because the distributions p(D|Mi ) are normalized, we see thatthe particular data set D0 can have the highest value of the evidence for the modelof intermediate complexity. Essentially, the simpler model cannot ﬁt the data well,whereas the more complex model spreads its predictive probability over too broad arange of data sets and so assigns relatively small probability to any one of them.Implicit in the Bayesian model comparison framework is the assumption thatthe true distribution from which the data are generated is contained within the set ofmodels under consideration. Provided this is so, we can show that Bayesian modelcomparison will on average favour the correct model.

To see this, consider twomodels M1 and M2 in which the truth corresponds to M1 . For a given ﬁnite dataset, it is possible for the Bayes factor to be larger for the incorrect model. However, ifwe average the Bayes factor over the distribution of data sets, we obtain the expectedBayes factor in the formp(D|M1 )p(D|M1 ) lndD(3.73)p(D|M2 )Section 1.6.1where the average has been taken with respect to the true distribution of the data.This quantity is an example of the Kullback-Leibler divergence and satisﬁes the property of always being positive unless the two distributions are equal in which case itis zero. Thus on average the Bayes factor will always favour the correct model.We have seen that the Bayesian framework avoids the problem of over-ﬁttingand allows models to be compared on the basis of the training data alone.

However,3.5. The Evidence Approximation165a Bayesian approach, like any approach to pattern recognition, needs to make assumptions about the form of the model, and if these are invalid then the results canbe misleading. In particular, we see from Figure 3.12 that the model evidence canbe sensitive to many aspects of the prior, such as the behaviour in the tails. Indeed,the evidence is not deﬁned if the prior is improper, as can be seen by noting thatan improper prior has an arbitrary scaling factor (in other words, the normalizationcoefﬁcient is not deﬁned because the distribution cannot be normalized). If we consider a proper prior and then take a suitable limit in order to obtain an improper prior(for example, a Gaussian prior in which we take the limit of inﬁnite variance) thenthe evidence will go to zero, as can be seen from (3.70) and Figure 3.12.

It may,however, be possible to consider the evidence ratio between two models ﬁrst andthen take a limit to obtain a meaningful answer.In a practical application, therefore, it will be wise to keep aside an independenttest set of data on which to evaluate the overall performance of the ﬁnal system.3.5.

The Evidence ApproximationIn a fully Bayesian treatment of the linear basis function model, we would introduce prior distributions over the hyperparameters α and β and make predictions bymarginalizing with respect to these hyperparameters as well as with respect to theparameters w. However, although we can integrate analytically over either w orover the hyperparameters, the complete marginalization over all of these variablesis analytically intractable.

Here we discuss an approximation in which we set thehyperparameters to speciﬁc values determined by maximizing the marginal likelihood function obtained by ﬁrst integrating over the parameters w. This frameworkis known in the statistics literature as empirical Bayes (Bernardo and Smith, 1994;Gelman et al., 2004), or type 2 maximum likelihood (Berger, 1985), or generalizedmaximum likelihood (Wahba, 1975), and in the machine learning literature is alsocalled the evidence approximation (Gull, 1989; MacKay, 1992a).If we introduce hyperpriors over α and β, the predictive distribution is obtainedby marginalizing over w, α and β so thatp(t|w, β)p(w|t, α, β)p(α, β|t) dw dα dβ(3.74)p(t|t) =where p(t|w, β) is given by (3.8) and p(w|t, α, β) is given by (3.49) with mN andSN deﬁned by (3.53) and (3.54) respectively.

Here we have omitted the dependenceon the input variable x to keep the notation uncluttered. If the posterior distribution then the predictive distribution is and β,p(α, β|t) is sharply peaked around values αobtained simply by marginalizing over w in which α and β are ﬁxed to the values αand β, so that dw. , β) = p(t|w, β)p(w|t,α , β)(3.75)p(t|t) p(t|t, α1663. LINEAR MODELS FOR REGRESSIONFrom Bayes’ theorem, the posterior distribution for α and β is given byp(α, β|t) ∝ p(t|α, β)p(α, β).(3.76) andIf the prior is relatively ﬂat, then in the evidence framework the values of α are obtained by maximizing the marginal likelihood function p(t|α, β).

We shallβproceed by evaluating the marginal likelihood for the linear basis function model andthen ﬁnding its maxima. This will allow us to determine values for these hyperparameters from the training data alone, without recourse to cross-validation. Recallthat the ratio α/β is analogous to a regularization parameter.As an aside it is worth noting that, if we deﬁne conjugate (Gamma) prior distributions over α and β, then the marginalization over these hyperparameters in (3.74)can be performed analytically to give a Student’s t-distribution over w (see Section 2.3.7).

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.