Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 35

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 35 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 352020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 35)

Thus x will always appear in the set of conditioning variables, and sofrom now on we will drop the explicit x from expressions such as p(t|x, w, β) in order to keep the notation uncluttered. Taking the logarithm of the likelihood function,and making use of the standard form (1.46) for the univariate Gaussian, we haveln p(t|w, β) =Nln N (tn |wT φ(xn ), β −1 )n=1=NNln β −ln(2π) − βED (w)22(3.11)where the sum-of-squares error function is deﬁned by1ED (w) ={tn − wT φ(xn )}2 .2N(3.12)n=1Having written down the likelihood function, we can use maximum likelihood todetermine w and β. Consider ﬁrst the maximization with respect to w.

As observedalready in Section 1.2.5, we see that maximization of the likelihood function under aconditional Gaussian noise distribution for a linear model is equivalent to minimizinga sum-of-squares error function given by ED (w). The gradient of the log likelihoodfunction (3.11) takes the form∇ ln p(t|w, β) =Nn=1tn − wT φ(xn ) φ(xn )T .(3.13)1423. LINEAR MODELS FOR REGRESSIONSetting this gradient to zero gives0=Ntn φ(xn ) − wTTn=1Nφ(xn )φ(xn )T.(3.14)n=1Solving for w we obtain−1 TΦ twML = ΦT Φ(3.15)which are known as the normal equations for the least squares problem. Here Φ is anN ×M matrix, called the design matrix, whose elements are given by Φnj = φj (xn ),so that⎛⎞φ0 (x1 ) φ1 (x1 ) · · · φM −1 (x1 )⎜ φ0 (x2 ) φ1 (x2 ) · · · φM −1 (x2 ) ⎟⎟.Φ=⎜(3.16)........⎝⎠....φ0 (xN ) φ1 (xN ) · · ·The quantityφM −1 (xN )−1 TΦΦ† ≡ ΦT Φ(3.17)is known as the Moore-Penrose pseudo-inverse of the matrix Φ (Rao and Mitra,1971; Golub and Van Loan, 1996).

It can be regarded as a generalization of thenotion of matrix inverse to nonsquare matrices. Indeed, if Φ is square and invertible,then using the property (AB)−1 = B−1 A−1 we see that Φ† ≡ Φ−1 .At this point, we can gain some insight into the role of the bias parameter w0 . Ifwe make the bias parameter explicit, then the error function (3.12) becomesED (w) =NM−11{tn − w0 −wj φj (xn )}2 .2n=1(3.18)j =1Setting the derivative with respect to w0 equal to zero, and solving for w0 , we obtainw0 = t −M−1wj φj(3.19)j =1where we have deﬁnedN1 t=tn ,NN1 φj =φj (xn ).Nn=1(3.20)n=1Thus the bias w0 compensates for the difference between the averages (over thetraining set) of the target values and the weighted sum of the averages of the basisfunction values.We can also maximize the log likelihood function (3.11) with respect to the noiseprecision parameter β, giving1βML=N1 T{tn − wMLφ(xn )}2Nn=1(3.21)1433.1.

Linear Basis Function ModelsFigure 3.2Geometrical interpretation of the least-squaressolution, in an N -dimensional space whose axesare the values of t1 , . . . , tN . The least-squaresregression function is obtained by ﬁnding the orthogonal projection of the data vector t onto thesubspace spanned by the basis functions φj (x)in which each basis function is viewed as a vector ϕj of length N with elements φj (xn ).Sϕ1tyϕ2and so we see that the inverse of the noise precision is given by the residual varianceof the target values around the regression function.3.1.2 Geometry of least squaresExercise 3.2At this point, it is instructive to consider the geometrical interpretation of theleast-squares solution.

To do this we consider an N -dimensional space whose axesare given by the tn , so that t = (t1 , . . . , tN )T is a vector in this space. Each basisfunction φj (xn ), evaluated at the N data points, can also be represented as a vector inthe same space, denoted by ϕj , as illustrated in Figure 3.2. Note that ϕj correspondsto the j th column of Φ, whereas φ(xn ) corresponds to the nth row of Φ. If thenumber M of basis functions is smaller than the number N of data points, then theM vectors φj (xn ) will span a linear subspace S of dimensionality M .

We deﬁney to be an N -dimensional vector whose nth element is given by y(xn , w), wheren = 1, . . . , N . Because y is an arbitrary linear combination of the vectors ϕj , it canlive anywhere in the M -dimensional subspace. The sum-of-squares error (3.12) isthen equal (up to a factor of 1/2) to the squared Euclidean distance between y andt.

Thus the least-squares solution for w corresponds to that choice of y that lies insubspace S and that is closest to t. Intuitively, from Figure 3.2, we anticipate thatthis solution corresponds to the orthogonal projection of t onto the subspace S. Thisis indeed the case, as can easily be veriﬁed by noting that the solution for y is givenby ΦwML , and then conﬁrming that this takes the form of an orthogonal projection.In practice, a direct solution of the normal equations can lead to numerical difﬁculties when ΦT Φ is close to singular. In particular, when two or more of the basisvectors ϕj are co-linear, or nearly so, the resulting parameter values can have largemagnitudes. Such near degeneracies will not be uncommon when dealing with realdata sets. The resulting numerical difﬁculties can be addressed using the techniqueof singular value decomposition, or SVD (Press et al., 1992; Bishop and Nabney,2008).

Note that the addition of a regularization term ensures that the matrix is nonsingular, even in the presence of degeneracies.3.1.3 Sequential learningBatch techniques, such as the maximum likelihood solution (3.15), which involve processing the entire training set in one go, can be computationally costly forlarge data sets. As we have discussed in Chapter 1, if the data set is sufﬁciently large,it may be worthwhile to use sequential algorithms, also known as on-line algorithms,1443. LINEAR MODELS FOR REGRESSIONin which the data points are considered one at a time, and the model parameters updated after each such presentation.

Sequential learning is also appropriate for realtime applications in which the data observations are arriving in a continuous stream,and predictions must be made before all of the data points are seen.We can obtain a sequential learning algorithm by applying the technique ofstochastic gradient descent, also known as sequential gradient descent, as follows. Ifthe error function comprises a sum over data points E = n En , then after presentation of pattern n, the stochastic gradient descent algorithm updates the parametervector w usingw(τ +1) = w(τ ) − η∇En(3.22)where τ denotes the iteration number, and η is a learning rate parameter. We shalldiscuss the choice of value for η shortly.

The value of w is initialized to some startingvector w(0) . For the case of the sum-of-squares error function (3.12), this givesw(τ +1) = w(τ ) + η(tn − w(τ )T φn )φn(3.23)where φn = φ(xn ). This is known as least-mean-squares or the LMS algorithm.The value of η needs to be chosen with care to ensure that the algorithm converges(Bishop and Nabney, 2008).3.1.4 Regularized least squaresIn Section 1.1, we introduced the idea of adding a regularization term to anerror function in order to control over-ﬁtting, so that the total error function to beminimized takes the form(3.24)ED (w) + λEW (w)where λ is the regularization coefﬁcient that controls the relative importance of thedata-dependent error ED (w) and the regularization term EW (w). One of the simplest forms of regularizer is given by the sum-of-squares of the weight vector elements1EW (w) = wT w.(3.25)2If we also consider the sum-of-squares error function given by1{tn − wT φ(xn )}2E(w) =2N(3.26)n=1then the total error function becomesλ1{tn − wT φ(xn )}2 + wT w.22N(3.27)n=1This particular choice of regularizer is known in the machine learning literature asweight decay because in sequential learning algorithms, it encourages weight valuesto decay towards zero, unless supported by the data.

In statistics, it provides an example of a parameter shrinkage method because it shrinks parameter values towards3.1. Linear Basis Function Modelsq = 0.5Figure 3.3q=1q=2145q=4Contours of the regularization term in (3.29) for various values of the parameter q.zero. It has the advantage that the error function remains a quadratic function ofw, and so its exact minimizer can be found in closed form. Speciﬁcally, setting thegradient of (3.27) with respect to w to zero, and solving for w as before, we obtain−1 TΦ t.(3.28)w = λI + ΦT ΦThis represents a simple extension of the least-squares solution (3.15).A more general regularizer is sometimes used, for which the regularized errortakes the formNM1λ{tn − wT φ(xn )}2 +|wj |q(3.29)22n=1Exercise 3.5j =1where q = 2 corresponds to the quadratic regularizer (3.27).

Figure 3.3 shows contours of the regularization function for different values of q.The case of q = 1 is know as the lasso in the statistics literature (Tibshirani,1996). It has the property that if λ is sufﬁciently large, some of the coefﬁcientswj are driven to zero, leading to a sparse model in which the corresponding basisfunctions play no role.

To see this, we ﬁrst note that minimizing (3.29) is equivalentto minimizing the unregularized sum-of-squares error (3.12) subject to the constraintM|wj |q η(3.30)j =1Appendix Efor an appropriate value of the parameter η, where the two approaches can be relatedusing Lagrange multipliers. The origin of the sparsity can be seen from Figure 3.4,which shows that the minimum of the error function, subject to the constraint (3.30).As λ is increased, so an increasing number of parameters are driven to zero.Regularization allows complex models to be trained on data sets of limited sizewithout severe over-ﬁtting, essentially by limiting the effective model complexity.However, the problem of determining the optimal model complexity is then shiftedfrom one of ﬁnding the appropriate number of basis functions to one of determininga suitable value of the regularization coefﬁcient λ. We shall return to the issue ofmodel complexity later in this chapter.1463.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.