Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 36

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 36 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 362020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 36)

LINEAR MODELS FOR REGRESSIONFigure 3.4 Plot of the contoursof the unregularized error function(blue) along with the constraint region (3.30) for the quadratic regularizer q = 2 on the left and the lassoregularizer q = 1 on the right, inwhich the optimum value for the parameter vector w is denoted by w .The lasso gives a sparse solution inwhich w1 = 0.w2w2www1w1For the remainder of this chapter we shall focus on the quadratic regularizer(3.27) both for its practical importance and its analytical tractability.3.1.5 Multiple outputsSo far, we have considered the case of a single target variable t.

In some applications, we may wish to predict K > 1 target variables, which we denote collectivelyby the target vector t. This could be done by introducing a different set of basis functions for each component of t, leading to multiple, independent regression problems.However, a more interesting, and more common, approach is to use the same set ofbasis functions to model all of the components of the target vector so thaty(x, w) = WT φ(x)(3.31)where y is a K-dimensional column vector, W is an M × K matrix of parameters,and φ(x) is an M -dimensional column vector with elements φj (x), with φ0 (x) = 1as before.

Suppose we take the conditional distribution of the target vector to be anisotropic Gaussian of the formp(t|x, W, β) = N (t|WT φ(x), β −1 I).(3.32)If we have a set of observations t1 , . . . , tN , we can combine these into a matrix Tof size N × K such that the nth row is given by tTn . Similarly, we can combine theinput vectors x1 , . . . , xN into a matrix X. The log likelihood function is then givenbyln p(T|X, W, β) =Nln N (tn |WT φ(xn ), β −1 I)n=1=NKln2β2π−N'β ''tn − WT φ(xn )'2 . (3.33)2n=13.2. The Bias-Variance DecompositionExercise 3.6147As before, we can maximize this function with respect to W, giving−1 TΦ T.WML = ΦT Φ(3.34)If we examine this result for each target variable tk , we have−1 TΦ tk = Φ† tkwk = ΦT Φ(3.35)where tk is an N -dimensional column vector with components tnk for n = 1, .

. . N .Thus the solution to the regression problem decouples between the different targetvariables, and we need only compute a single pseudo-inverse matrix Φ† , which isshared by all of the vectors wk .The extension to general Gaussian noise distributions having arbitrary covariance matrices is straightforward. Again, this leads to a decoupling into K independent regression problems. This result is unsurprising because the parameters Wdeﬁne only the mean of the Gaussian noise distribution, and we know from Section 2.3.4 that the maximum likelihood solution for the mean of a multivariate Gaussian is independent of the covariance.

From now on, we shall therefore consider asingle target variable t for simplicity.3.2. The Bias-Variance DecompositionSo far in our discussion of linear models for regression, we have assumed that theform and number of basis functions are both ﬁxed. As we have seen in Chapter 1,the use of maximum likelihood, or equivalently least squares, can lead to severeover-ﬁtting if complex models are trained using data sets of limited size. However,limiting the number of basis functions in order to avoid over-ﬁtting has the sideeffect of limiting the ﬂexibility of the model to capture interesting and importanttrends in the data. Although the introduction of regularization terms can controlover-ﬁtting for models with many parameters, this raises the question of how todetermine a suitable value for the regularization coefﬁcient λ.

Seeking the solutionthat minimizes the regularized error function with respect to both the weight vectorw and the regularization coefﬁcient λ is clearly not the right approach since thisleads to the unregularized solution with λ = 0.As we have seen in earlier chapters, the phenomenon of over-ﬁtting is really anunfortunate property of maximum likelihood and does not arise when we marginalizeover parameters in a Bayesian setting.

In this chapter, we shall consider the Bayesianview of model complexity in some depth. Before doing so, however, it is instructiveto consider a frequentist viewpoint of the model complexity issue, known as the biasvariance trade-off. Although we shall introduce this concept in the context of linearbasis function models, where it is easy to illustrate the ideas using simple examples,the discussion has more general applicability.In Section 1.5.5, when we discussed decision theory for regression problems,we considered various loss functions each of which leads to a corresponding optimalprediction once we are given the conditional distribution p(t|x).

A popular choice is1483. LINEAR MODELS FOR REGRESSIONthe squared loss function, for which the optimal prediction is given by the conditionalexpectation, which we denote by h(x) and which is given byh(x) = E[t|x] = tp(t|x) dt.(3.36)At this point, it is worth distinguishing between the squared loss function arisingfrom decision theory and the sum-of-squares error function that arose in the maximum likelihood estimation of model parameters.

We might use more sophisticatedtechniques than least squares, for example regularization or a fully Bayesian approach, to determine the conditional distribution p(t|x). These can all be combinedwith the squared loss function for the purpose of making predictions.We showed in Section 1.5.5 that the expected squared loss can be written in theform2E[L] = {y(x) − h(x)} p(x) dx + {h(x) − t}2 p(x, t) dx dt.(3.37)Recall that the second term, which is independent of y(x), arises from the intrinsicnoise on the data and represents the minimum achievable value of the expected loss.The ﬁrst term depends on our choice for the function y(x), and we will seek a solution for y(x) which makes this term a minimum.

Because it is nonnegative, thesmallest that we can hope to make this term is zero. If we had an unlimited supply ofdata (and unlimited computational resources), we could in principle ﬁnd the regression function h(x) to any desired degree of accuracy, and this would represent theoptimal choice for y(x).

However, in practice we have a data set D containing onlya ﬁnite number N of data points, and consequently we do not know the regressionfunction h(x) exactly.If we model the h(x) using a parametric function y(x, w) governed by a parameter vector w, then from a Bayesian perspective the uncertainty in our model isexpressed through a posterior distribution over w. A frequentist treatment, however,involves making a point estimate of w based on the data set D, and tries insteadto interpret the uncertainty of this estimate through the following thought experiment.

Suppose we had a large number of data sets each of size N and each drawnindependently from the distribution p(t, x). For any given data set D, we can runour learning algorithm and obtain a prediction function y(x; D). Different data setsfrom the ensemble will give different functions and consequently different values ofthe squared loss. The performance of a particular learning algorithm is then assessedby taking the average over this ensemble of data sets.Consider the integrand of the ﬁrst term in (3.37), which for a particular data setD takes the form(3.38){y(x; D) − h(x)}2 .Because this quantity will be dependent on the particular data set D, we take its average over the ensemble of data sets.

If we add and subtract the quantity ED [y(x; D)]3.2. The Bias-Variance Decomposition149inside the braces, and then expand, we obtain{y(x; D) − ED [y(x; D)] + ED [y(x; D)] − h(x)}2= {y(x; D) − ED [y(x; D)]}2 + {ED [y(x; D)] − h(x)}2+2{y(x; D) − ED [y(x; D)]}{ED [y(x; D)] − h(x)}.(3.39)We now take the expectation of this expression with respect to D and note that theﬁnal term will vanish, givingED {y(x; D) − h(x)}2= {ED [y(x; D)] − h(x)}2 + ED {y(x; D) − ED [y(x; D)]}2 .

(3.40)()*+ ()*+(bias)2varianceWe see that the expected squared difference between y(x; D) and the regressionfunction h(x) can be expressed as the sum of two terms. The ﬁrst term, called thesquared bias, represents the extent to which the average prediction over all data setsdiffers from the desired regression function. The second term, called the variance,measures the extent to which the solutions for individual data sets vary around theiraverage, and hence this measures the extent to which the function y(x; D) is sensitiveto the particular choice of data set. We shall provide some intuition to support thesedeﬁnitions shortly when we consider a simple example.So far, we have considered a single input value x. If we substitute this expansionback into (3.37), we obtain the following decomposition of the expected squared lossexpected loss = (bias)2 + variance + noisewhere(bias)2=variance =noise =Appendix A(3.41){ED [y(x; D)] − h(x)}2 p(x) dx(3.42)ED {y(x; D) − ED [y(x; D)]}2 p(x) dx(3.43){h(x) − t}2 p(x, t) dx dt(3.44)and the bias and variance terms now refer to integrated quantities.Our goal is to minimize the expected loss, which we have decomposed into thesum of a (squared) bias, a variance, and a constant noise term.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.