Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 37

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 37 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 372020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 37)

As we shall see, thereis a trade-off between bias and variance, with very ﬂexible models having low biasand high variance, and relatively rigid models having high bias and low variance.The model with the optimal predictive capability is the one that leads to the bestbalance between bias and variance. This is illustrated by considering the sinusoidaldata set from Chapter 1. Here we generate 100 data sets, each containing N = 25data points, independently from the sinusoidal curve h(x) = sin(2πx).

The datasets are indexed by l = 1, . . . , L, where L = 100, and for each data set D(l) we1503. LINEAR MODELS FOR REGRESSION11ln λ = 2.6tt00−1−101x10−1−11x10−1−1xx10x1t0001ln λ = −2.4t1t00x1ln λ = −0.31t01Figure 3.5 Illustration of the dependence of bias and variance on model complexity, governed by a regularization parameter λ, using the sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N = 25data points, and there are 24 Gaussian basis functions in the model so that the total number of parameters isM = 25 including the bias parameter. The left column shows the result of ﬁtting the model to the data sets forvarious values of ln λ (for clarity, only 20 of the 100 ﬁts are shown).

The right column shows the correspondingaverage of the 100 ﬁts (red) along with the sinusoidal function from which the data sets were generated (green).1513.2. The Bias-Variance DecompositionFigure 3.6Plot of squared bias and variance,together with their sum, corresponding to the results shown in Figure 3.5. Also shown is the averagetest set error for a test data set sizeof 1000 points. The minimum valueof (bias)2 + variance occurs aroundln λ = −0.31, which is close to thevalue that gives the minimum erroron the test data.0.15(bias)2variance(bias)2 + variancetest error0.120.090.060.030−3−2−1012ln λﬁt a model with 24 Gaussian basis functions by minimizing the regularized errorfunction (3.27) to give a prediction function y (l) (x) as shown in Figure 3.5.

Thetop row corresponds to a large value of the regularization coefﬁcient λ that gives lowvariance (because the red curves in the left plot look similar) but high bias (becausethe two curves in the right plot are very different). Conversely on the bottom row, forwhich λ is small, there is large variance (shown by the high variability between thered curves in the left plot) but low bias (shown by the good ﬁt between the averagemodel ﬁt and the original sinusoidal function). Note that the result of averaging manysolutions for the complex model with M = 25 is a very good ﬁt to the regressionfunction, which suggests that averaging may be a beneﬁcial procedure.

Indeed, aweighted averaging of multiple solutions lies at the heart of a Bayesian approach,although the averaging is with respect to the posterior distribution of parameters, notwith respect to multiple data sets.We can also examine the bias-variance trade-off quantitatively for this example.The average prediction is estimated from1 (l )y(x) =y (x)LL(3.45)l=1and the integrated squared bias and integrated variance are then given by(bias)2=N1 2{y(xn ) − h(xn )}N(3.46)NL21 1 (l )y (xn ) − y(xn )NL(3.47)n=1variance =n=1l=1where the integral over x weighted by the distribution p(x) is approximated by aﬁnite sum over data points drawn from that distribution. These quantities, alongwith their sum, are plotted as a function of ln λ in Figure 3.6.

We see that smallvalues of λ allow the model to become ﬁnely tuned to the noise on each individual1523. LINEAR MODELS FOR REGRESSIONdata set leading to large variance. Conversely, a large value of λ pulls the weightparameters towards zero leading to large bias.Although the bias-variance decomposition may provide some interesting insights into the model complexity issue from a frequentist perspective, it is of limited practical value, because the bias-variance decomposition is based on averageswith respect to ensembles of data sets, whereas in practice we have only the singleobserved data set.

If we had a large number of independent training sets of a givensize, we would be better off combining them into a single large training set, whichof course would reduce the level of over-ﬁtting for a given model complexity.Given these limitations, we turn in the next section to a Bayesian treatment oflinear basis function models, which not only provides powerful insights into theissues of over-ﬁtting but which also leads to practical techniques for addressing thequestion model complexity.3.3. Bayesian Linear RegressionIn our discussion of maximum likelihood for setting the parameters of a linear regression model, we have seen that the effective model complexity, governed by thenumber of basis functions, needs to be controlled according to the size of the dataset. Adding a regularization term to the log likelihood function means the effectivemodel complexity can then be controlled by the value of the regularization coefﬁcient, although the choice of the number and form of the basis functions is of coursestill important in determining the overall behaviour of the model.This leaves the issue of deciding the appropriate model complexity for the particular problem, which cannot be decided simply by maximizing the likelihood function, because this always leads to excessively complex models and over-ﬁtting.

Independent hold-out data can be used to determine model complexity, as discussedin Section 1.3, but this can be both computationally expensive and wasteful of valuable data. We therefore turn to a Bayesian treatment of linear regression, which willavoid the over-ﬁtting problem of maximum likelihood, and which will also lead toautomatic methods of determining model complexity using the training data alone.Again, for simplicity we will focus on the case of a single target variable t. Extension to multiple target variables is straightforward and follows the discussion ofSection 3.1.5.3.3.1 Parameter distributionWe begin our discussion of the Bayesian treatment of linear regression by introducing a prior probability distribution over the model parameters w.

For the moment, we shall treat the noise precision parameter β as a known constant. First notethat the likelihood function p(t|w) deﬁned by (3.10) is the exponential of a quadraticfunction of w. The corresponding conjugate prior is therefore given by a Gaussiandistribution of the form(3.48)p(w) = N (w|m0 , S0 )having mean m0 and covariance S0 .3.3. Bayesian Linear RegressionExercise 3.7153Next we compute the posterior distribution, which is proportional to the productof the likelihood function and the prior. Due to the choice of a conjugate Gaussian prior distribution, the posterior will also be Gaussian.

We can evaluate thisdistribution by the usual procedure of completing the square in the exponential, andthen ﬁnding the normalization coefﬁcient using the standard result for a normalizedGaussian. However, we have already done the necessary work in deriving the general result (2.116), which allows us to write down the posterior distribution directlyin the formp(w|t) = N (w|mN , SN )(3.49)where 1T= SN S−0 m0 + βΦ tmN1S−NExercise 3.8=1S−0+ βΦ Φ.T(3.50)(3.51)Note that because the posterior distribution is Gaussian, its mode coincides with itsmean.

Thus the maximum posterior weight vector is simply given by wMAP = mN .If we consider an inﬁnitely broad prior S0 = α−1 I with α → 0, the mean mNof the posterior distribution reduces to the maximum likelihood value wML givenby (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior.Furthermore, if data points arrive sequentially, then the posterior distribution at anystage acts as the prior distribution for the subsequent data point, such that the newposterior distribution is again given by (3.49).For the remainder of this chapter, we shall consider a particular form of Gaussian prior in order to simplify the treatment.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.