Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 11

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 11 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 112020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 11)

In practice, for anything other than small N , this bias will notprove to be a serious problem. However, throughout this book we shall be interestedin more complex models with many parameters, for which the bias problems associated with maximum likelihood will be much more severe. In fact, as we shall see,the issue of bias in maximum likelihood lies at the root of the over-ﬁtting problemthat we encountered earlier in the context of polynomial curve ﬁtting.1.2.5 Curve ﬁtting re-visitedSection 1.1We have seen how the problem of polynomial curve ﬁtting can be expressed interms of error minimization.

Here we return to the curve ﬁtting example and view itfrom a probabilistic perspective, thereby gaining some insights into error functionsand regularization, as well as taking us towards a full Bayesian treatment.The goal in the curve ﬁtting problem is to be able to make predictions for thetarget variable t given some new value of the input variable x on the basis of a set oftraining data comprising N input values x = (x1 , . . . , xN )T and their correspondingtarget values t = (t1 , . . . , tN )T . We can express our uncertainty over the value ofthe target variable using a probability distribution. For this purpose, we shall assumethat, given the value of x, the corresponding value of t has a Gaussian distributionwith a mean equal to the value y(x, w) of the polynomial curve given by (1.1).

Thuswe have(1.60)p(t|x, w, β) = N t|y(x, w), β −1where, for consistency with the notation in later chapters, we have deﬁned a precision parameter β corresponding to the inverse variance of the distribution. This isillustrated schematically in Figure 1.16.291.2. Probability TheoryFigure 1.16 Schematic illustration of a Gaussian conditional distribution for t given x given by(1.60), in which the mean is given by the polynomial function y(x, w), and the precision is givenby the parameter β, which is related to the variance by β −1 = σ 2 .ty(x, w)y(x0 , w)2σp(t|x0 , w, β)x0xWe now use the training data {x, t} to determine the values of the unknownparameters w and β by maximum likelihood.

If the data are assumed to be drawnindependently from the distribution (1.60), then the likelihood function is given byp(t|x, w, β) =NN tn |y(xn , w), β −1 .(1.61)n=1As we did in the case of the simple Gaussian distribution earlier, it is convenient tomaximize the logarithm of the likelihood function. Substituting for the form of theGaussian distribution, given by (1.46), we obtain the log likelihood function in theformln p(t|x, w, β) = −NNβN2ln β −ln(2π).{y(xn , w) − tn } +222(1.62)n=1Consider ﬁrst the determination of the maximum likelihood solution for the polynomial coefﬁcients, which will be denoted by wML .

These are determined by maximizing (1.62) with respect to w. For this purpose, we can omit the last two termson the right-hand side of (1.62) because they do not depend on w. Also, we notethat scaling the log likelihood by a positive constant coefﬁcient does not alter thelocation of the maximum with respect to w, and so we can replace the coefﬁcientβ/2 with 1/2. Finally, instead of maximizing the log likelihood, we can equivalentlyminimize the negative log likelihood.

We therefore see that maximizing likelihood isequivalent, so far as determining w is concerned, to minimizing the sum-of-squareserror function deﬁned by (1.2). Thus the sum-of-squares error function has arisen asa consequence of maximizing likelihood under the assumption of a Gaussian noisedistribution.We can also use maximum likelihood to determine the precision parameter β ofthe Gaussian conditional distribution. Maximizing (1.62) with respect to β gives1βML=N1 2{y(xn , wML ) − tn } .Nn=1(1.63)301. INTRODUCTIONSection 1.2.4Again we can ﬁrst determine the parameter vector wML governing the mean and subsequently use this to ﬁnd the precision βML as was the case for the simple Gaussiandistribution.Having determined the parameters w and β, we can now make predictions fornew values of x.

Because we now have a probabilistic model, these are expressedin terms of the predictive distribution that gives the probability distribution over t,rather than simply a point estimate, and is obtained by substituting the maximumlikelihood parameters into (1.60) to give−1.(1.64)p(t|x, wML , βML ) = N t|y(x, wML ), βMLNow let us take a step towards a more Bayesian approach and introduce a priordistribution over the polynomial coefﬁcients w.

For simplicity, let us consider aGaussian distribution of the form α (M +1)/2 αexp − wT w(1.65)p(w|α) = N (w|0, α−1 I) =2π2where α is the precision of the distribution, and M +1 is the total number of elementsin the vector w for an M th order polynomial. Variables such as α, which controlthe distribution of model parameters, are called hyperparameters. Using Bayes’theorem, the posterior distribution for w is proportional to the product of the priordistribution and the likelihood functionp(w|x, t, α, β) ∝ p(t|x, w, β)p(w|α).(1.66)We can now determine w by ﬁnding the most probable value of w given the data,in other words by maximizing the posterior distribution.

This technique is calledmaximum posterior, or simply MAP. Taking the negative logarithm of (1.66) andcombining with (1.62) and (1.65), we ﬁnd that the maximum of the posterior isgiven by the minimum ofNαβ{y(xn , w) − tn }2 + wT w.22(1.67)n=1Thus we see that maximizing the posterior distribution is equivalent to minimizingthe regularized sum-of-squares error function encountered earlier in the form (1.4),with a regularization parameter given by λ = α/β.1.2.6 Bayesian curve ﬁttingAlthough we have included a prior distribution p(w|α), we are so far still making a point estimate of w and so this does not yet amount to a Bayesian treatment. Ina fully Bayesian approach, we should consistently apply the sum and product rulesof probability, which requires, as we shall see shortly, that we integrate over all values of w. Such marginalizations lie at the heart of Bayesian methods for patternrecognition.1.2.

Probability Theory31In the curve ﬁtting problem, we are given the training data x and t, along witha new test point x, and our goal is to predict the value of t. We therefore wishto evaluate the predictive distribution p(t|x, x, t). Here we shall assume that theparameters α and β are ﬁxed and known in advance (in later chapters we shall discusshow such parameters can be inferred from data in a Bayesian setting).A Bayesian treatment simply corresponds to a consistent application of the sumand product rules of probability, which allow the predictive distribution to be writtenin the formp(t|x, x, t) = p(t|x, w)p(w|x, t) dw.(1.68)Here p(t|x, w) is given by (1.60), and we have omitted the dependence on α andβ to simplify the notation.

Here p(w|x, t) is the posterior distribution over parameters, and can be found by normalizing the right-hand side of (1.66). We shall seein Section 3.3 that, for problems such as the curve-ﬁtting example, this posteriordistribution is a Gaussian and can be evaluated analytically. Similarly, the integration in (1.68) can also be performed analytically with the result that the predictivedistribution is given by a Gaussian of the formp(t|x, x, t) = N t|m(x), s2 (x)(1.69)where the mean and variance are given bym(x) = βφ(x) STNφ(xn )tnn=1Ts2 (x) = β −1 + φ(x) Sφ(x).(1.70)(1.71)Here the matrix S is given byS−1 = αI + βNφ(xn )φ(x)T(1.72)n=1where I is the unit matrix, and we have deﬁned the vector φ(x) with elementsφi (x) = xi for i = 0, .

. . , M .We see that the variance, as well as the mean, of the predictive distribution in(1.69) is dependent on x. The ﬁrst term in (1.71) represents the uncertainty in thepredicted value of t due to the noise on the target variables and was expressed already−1in the maximum likelihood predictive distribution (1.64) through βML. However, thesecond term arises from the uncertainty in the parameters w and is a consequenceof the Bayesian treatment. The predictive distribution for the synthetic sinusoidalregression problem is illustrated in Figure 1.17.321.

INTRODUCTIONFigure 1.17The predictive distribution resulting from a Bayesian treatment ofpolynomial curve ﬁtting using anM = 9 polynomial, with the ﬁxedparameters α = 5 × 10−3 and β =11.1 (corresponding to the knownnoise variance), in which the redcurve denotes the mean of thepredictive distribution and the redregion corresponds to ±1 standard deviation around the mean.1t0−10x11.3. Model SelectionIn our example of polynomial curve ﬁtting using least squares, we saw that there wasan optimal order of polynomial that gave the best generalization.

The order of thepolynomial controls the number of free parameters in the model and thereby governsthe model complexity. With regularized least squares, the regularization coefﬁcientλ also controls the effective complexity of the model, whereas for more complexmodels, such as mixture distributions or neural networks there may be multiple parameters governing complexity. In a practical application, we need to determinethe values of such parameters, and the principal objective in doing so is usually toachieve the best predictive performance on new data. Furthermore, as well as ﬁnding the appropriate values for complexity parameters within a given model, we maywish to consider a range of different types of model in order to ﬁnd the best one forour particular application.We have already seen that, in the maximum likelihood approach, the performance on the training set is not a good indicator of predictive performance on unseen data due to the problem of over-ﬁtting.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.