Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 39

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 39 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 392020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 39)

In this case, thepredictive distribution is a Student’s t-distribution.3.3. Bayesian Linear Regression159Figure 3.10 The equivalent kernel k(x, x ) for the Gaussian basisfunctions in Figure 3.1, shown asa plot of x versus x , together withthree slices through this matrix corresponding to three different valuesof x. The data set used to generatethis kernel comprised 200 values ofx equally spaced over the interval(−1, 1).3.3.3 Equivalent kernelChapter 6The posterior mean solution (3.53) for the linear basis function model has an interesting interpretation that will set the stage for kernel methods, including Gaussianprocesses.

If we substitute (3.53) into the expression (3.3), we see that the predictivemean can be written in the formy(x, mN ) =mTN φ(x)= βφ(x) SN Φ t =TTNβφ(x)T SN φ(xn )tn(3.60)n=1where SN is deﬁned by (3.51). Thus the mean of the predictive distribution at a pointx is given by a linear combination of the training set target variables tn , so that wecan writeNy(x, mN ) =k(x, xn )tn(3.61)n=1where the functionk(x, x ) = βφ(x)T SN φ(x )(3.62)is known as the smoother matrix or the equivalent kernel.

Regression functions, suchas this, which make predictions by taking linear combinations of the training settarget values are known as linear smoothers. Note that the equivalent kernel dependson the input values xn from the data set because these appear in the deﬁnition ofSN . The equivalent kernel is illustrated for the case of Gaussian basis functions inFigure 3.10 in which the kernel functions k(x, x ) have been plotted as a function ofx for three different values of x. We see that they are localized around x, and so themean of the predictive distribution at x, given by y(x, mN ), is obtained by forminga weighted combination of the target values in which data points close to x are givenhigher weight than points further removed from x.

Intuitively, it seems reasonablethat we should weight local evidence more strongly than distant evidence. Note thatthis localization property holds not only for the localized Gaussian basis functionsbut also for the nonlocal polynomial and sigmoidal basis functions, as illustrated inFigure 3.11.1603. LINEAR MODELS FOR REGRESSIONFigure 3.11 Examples of equivalent kernels k(x, x ) for x = 0plotted as a function of x , corresponding (left) to the polynomial basis functions and (right) to the sigmoidal basis functions shown in Figure 3.1. Note that these are localized functions of x even though thecorresponding basis functions arenonlocal.0.040.040.020.0200−101−101Further insight into the role of the equivalent kernel can be obtained by considering the covariance between y(x) and y(x ), which is given bycov[y(x), y(x )] = cov[φ(x)T w, wT φ(x )]= φ(x)T SN φ(x ) = β −1 k(x, x )(3.63)where we have made use of (3.49) and (3.62).

From the form of the equivalentkernel, we see that the predictive mean at nearby points will be highly correlated,whereas for more distant pairs of points the correlation will be smaller.The predictive distribution shown in Figure 3.8 allows us to visualize the pointwise uncertainty in the predictions, governed by (3.59). However, by drawing samples from the posterior distribution over w, and plotting the corresponding modelfunctions y(x, w) as in Figure 3.9, we are visualizing the joint uncertainty in theposterior distribution between the y values at two (or more) x values, as governed bythe equivalent kernel.The formulation of linear regression in terms of a kernel function suggests analternative approach to regression as follows. Instead of introducing a set of basisfunctions, which implicitly determines an equivalent kernel, we can instead deﬁnea localized kernel directly and use this to make predictions for new input vectors x,given the observed training set.

This leads to a practical framework for regression(and classiﬁcation) called Gaussian processes, which will be discussed in detail inSection 6.4.We have seen that the effective kernel deﬁnes the weights by which the trainingset target values are combined in order to make a prediction at a new value of x, andit can be shown that these weights sum to one, in other wordsNk(x, xn ) = 1(3.64)n=1Exercise 3.14for all values of x. This intuitively pleasing result can easily be proven informallyy (x)by noting that the summation is equivalent to considering the predictive mean for a set of target data in which tn = 1 for all n. Provided the basis functions arelinearly independent, that there are more data points than basis functions, and thatone of the basis functions is constant (corresponding to the bias parameter), then it isclear that we can ﬁt the training data exactly and hence that the predictive mean will3.4.

Bayesian Model ComparisonChapter 6161y (x) = 1, from which we obtain (3.64). Note that the kernel function canbe simply be negative as well as positive, so although it satisﬁes a summation constraint, thecorresponding predictions are not necessarily convex combinations of the trainingset target variables.Finally, we note that the equivalent kernel (3.62) satisﬁes an important propertyshared by kernel functions in general, namely that it can be expressed in the form aninner product with respect to a vector ψ(x) of nonlinear functions, so thatk(x, z) = ψ(x)T ψ(z)(3.65)1 /2where ψ(x) = β 1/2 SN φ(x).3.4. Bayesian Model ComparisonSection 1.5.4In Chapter 1, we highlighted the problem of over-ﬁtting as well as the use of crossvalidation as a technique for setting the values of regularization parameters or forchoosing between alternative models. Here we consider the problem of model selection from a Bayesian perspective.

In this section, our discussion will be verygeneral, and then in Section 3.5 we shall see how these ideas can be applied to thedetermination of regularization parameters in linear regression.As we shall see, the over-ﬁtting associated with maximum likelihood can beavoided by marginalizing (summing or integrating) over the model parameters instead of making point estimates of their values. Models can then be compared directly on the training data, without the need for a validation set. This allows allavailable data to be used for training and avoids the multiple training runs for eachmodel associated with cross-validation.

It also allows multiple complexity parameters to be determined simultaneously as part of the training process. For example,in Chapter 7 we shall introduce the relevance vector machine, which is a Bayesianmodel having one complexity parameter for every training data point.The Bayesian view of model comparison simply involves the use of probabilitiesto represent uncertainty in the choice of model, along with a consistent applicationof the sum and product rules of probability. Suppose we wish to compare a set of Lmodels {Mi } where i = 1, . .

. , L. Here a model refers to a probability distributionover the observed data D. In the case of the polynomial curve-ﬁtting problem, thedistribution is deﬁned over the set of target values t, while the set of input values Xis assumed to be known. Other types of model deﬁne a joint distributions over Xand t. We shall suppose that the data is generated from one of these models but weare uncertain which one. Our uncertainty is expressed through a prior probabilitydistribution p(Mi ).

Given a training set D, we then wish to evaluate the posteriordistribution(3.66)p(Mi |D) ∝ p(Mi )p(D|Mi ).The prior allows us to express a preference for different models. Let us simplyassume that all models are given equal prior probability. The interesting term isthe model evidence p(D|Mi ) which expresses the preference shown by the data for1623. LINEAR MODELS FOR REGRESSIONdifferent models, and we shall examine this term in more detail shortly. The modelevidence is sometimes also called the marginal likelihood because it can be viewedas a likelihood function over the space of models, in which the parameters have beenmarginalized out.

The ratio of model evidences p(D|Mi )/p(D|Mj ) for two modelsis known as a Bayes factor (Kass and Raftery, 1995).Once we know the posterior distribution over models, the predictive distributionis given, from the sum and product rules, byp(t|x, D) =Lp(t|x, Mi , D)p(Mi |D).(3.67)i=1This is an example of a mixture distribution in which the overall predictive distribution is obtained by averaging the predictive distributions p(t|x, Mi , D) of individualmodels, weighted by the posterior probabilities p(Mi |D) of those models. For instance, if we have two models that are a-posteriori equally likely and one predictsa narrow distribution around t = a while the other predicts a narrow distributionaround t = b, the overall predictive distribution will be a bimodal distribution withmodes at t = a and t = b, not a single model at t = (a + b)/2.A simple approximation to model averaging is to use the single most probablemodel alone to make predictions.

This is known as model selection.For a model governed by a set of parameters w, the model evidence is given,from the sum and product rules of probability, by(3.68)p(D|Mi ) = p(D|w, Mi )p(w|Mi ) dw.Chapter 11From a sampling perspective, the marginal likelihood can be viewed as the probability of generating the data set D from a model whose parameters are sampled atrandom from the prior.

It is also interesting to note that the evidence is precisely thenormalizing term that appears in the denominator in Bayes’ theorem when evaluatingthe posterior distribution over parameters becausep(w|D, Mi ) =p(D|w, Mi )p(w|Mi ).p(D|Mi )(3.69)We can obtain some insight into the model evidence by making a simple approximation to the integral over parameters. Consider ﬁrst the case of a model having asingle parameter w. The posterior distribution over parameters is proportional top(D|w)p(w), where we omit the dependence on the model Mi to keep the notationuncluttered.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.