Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 49

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 49 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 492020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 49)

Classes that are linearly separablein the feature space φ(x) need not be linearly separable in the original observationspace x. Note that as in our discussion of linear models for regression, one of the4.3. Probabilistic Discriminative ModelsSection 3.6205basis functions is typically set to a constant, say φ0 (x) = 1, so that the corresponding parameter w0 plays the role of a bias.

For the remainder of this chapter, we shallinclude a ﬁxed basis function transformation φ(x), as this will highlight some usefulsimilarities to the regression models discussed in Chapter 3.For many problems of practical interest, there is signiﬁcant overlap betweenthe class-conditional densities p(x|Ck ). This corresponds to posterior probabilitiesp(Ck |x), which, for at least some values of x, are not 0 or 1. In such cases, the optimal solution is obtained by modelling the posterior probabilities accurately and thenapplying standard decision theory, as discussed in Chapter 1. Note that nonlineartransformations φ(x) cannot remove such class overlap. Indeed, they can increasethe level of overlap, or create overlap where none existed in the original observationspace.

However, suitable choices of nonlinearity can make the process of modellingthe posterior probabilities easier.Such ﬁxed basis function models have important limitations, and these will beresolved in later chapters by allowing the basis functions themselves to adapt to thedata. Notwithstanding these limitations, models with ﬁxed nonlinear basis functionsplay an important role in applications, and a discussion of such models will introduce many of the key concepts needed for an understanding of their more complexcounterparts.4.3.2 Logistic regressionWe begin our treatment of generalized linear models by considering the problemof two-class classiﬁcation.

In our discussion of generative approaches in Section 4.2,we saw that under rather general assumptions, the posterior probability of class C1can be written as a logistic sigmoid acting on a linear function of the feature vectorφ so that(4.87)p(C1 |φ) = y(φ) = σ wT φExercise 4.12with p(C2 |φ) = 1 − p(C1 |φ). Here σ(·) is the logistic sigmoid function deﬁned by(4.59). In the terminology of statistics, this model is known as logistic regression,although it should be emphasized that this is a model for classiﬁcation rather thanregression.For an M -dimensional feature space φ, this model has M adjustable parameters.By contrast, if we had ﬁtted Gaussian class conditional densities using maximumlikelihood, we would have used 2M parameters for the means and M (M + 1)/2parameters for the (shared) covariance matrix. Together with the class prior p(C1 ),this gives a total of M (M +5)/2+1 parameters, which grows quadratically with M ,in contrast to the linear dependence on M of the number of parameters in logisticregression.

For large values of M , there is a clear advantage in working with thelogistic regression model directly.We now use maximum likelihood to determine the parameters of the logisticregression model. To do this, we shall make use of the derivative of the logistic sigmoid function, which can conveniently be expressed in terms of the sigmoid functionitselfdσ= σ(1 − σ).(4.88)da2064.

LINEAR MODELS FOR CLASSIFICATIONFor a data set {φn , tn }, where tn ∈ {0, 1} and φn = φ(xn ), with n =1, . . . , N , the likelihood function can be writtenp(t|w) =N1−tnyntn {1 − yn }(4.89)n=1where t = (t1 , . . . , tN )T and yn = p(C1 |φn ). As usual, we can deﬁne an errorfunction by taking the negative logarithm of the likelihood, which gives the crossentropy error function in the formE(w) = − ln p(t|w) = −N{tn ln yn + (1 − tn ) ln(1 − yn )}(4.90)n=1Exercise 4.13where yn = σ(an ) and an = wT φn . Taking the gradient of the error function withrespect to w, we obtain∇E(w) =N(yn − tn )φn(4.91)n=1Section 3.1.1Exercise 4.14where we have made use of (4.88).

We see that the factor involving the derivativeof the logistic sigmoid has cancelled, leading to a simpliﬁed form for the gradientof the log likelihood. In particular, the contribution to the gradient from data pointn is given by the ‘error’ yn − tn between the target value and the prediction of themodel, times the basis function vector φn . Furthermore, comparison with (3.13)shows that this takes precisely the same form as the gradient of the sum-of-squareserror function for the linear regression model.If desired, we could make use of the result (4.91) to give a sequential algorithmin which patterns are presented one at a time, in which each of the weight vectors isupdated using (3.22) in which ∇En is the nth term in (4.91).It is worth noting that maximum likelihood can exhibit severe over-ﬁtting fordata sets that are linearly separable.

This arises because the maximum likelihood solution occurs when the hyperplane corresponding to σ = 0.5, equivalent to wT φ =0, separates the two classes and the magnitude of w goes to inﬁnity. In this case, thelogistic sigmoid function becomes inﬁnitely steep in feature space, corresponding toa Heaviside step function, so that every training point from each class k is assigneda posterior probability p(Ck |x) = 1. Furthermore, there is typically a continuumof such solutions because any separating hyperplane will give rise to the same posterior probabilities at the training data points, as will be seen later in Figure 10.13.Maximum likelihood provides no way to favour one such solution over another, andwhich solution is found in practice will depend on the choice of optimization algorithm and on the parameter initialization. Note that the problem will arise even ifthe number of data points is large compared with the number of parameters in themodel, so long as the training data set is linearly separable.

The singularity can beavoided by inclusion of a prior and ﬁnding a MAP solution for w, or equivalently byadding a regularization term to the error function.4.3. Probabilistic Discriminative Models2074.3.3 Iterative reweighted least squaresIn the case of the linear regression models discussed in Chapter 3, the maximum likelihood solution, on the assumption of a Gaussian noise model, leads to aclosed-form solution.

This was a consequence of the quadratic dependence of thelog likelihood function on the parameter vector w. For logistic regression, thereis no longer a closed-form solution, due to the nonlinearity of the logistic sigmoidfunction. However, the departure from a quadratic form is not substantial. To beprecise, the error function is concave, as we shall see shortly, and hence has a uniqueminimum. Furthermore, the error function can be minimized by an efﬁcient iterativetechnique based on the Newton-Raphson iterative optimization scheme, which uses alocal quadratic approximation to the log likelihood function.

The Newton-Raphsonupdate, for minimizing a function E(w), takes the form (Fletcher, 1987; Bishop andNabney, 2008)w(new) = w(old) − H−1 ∇E(w).(4.92)where H is the Hessian matrix whose elements comprise the second derivatives ofE(w) with respect to the components of w.Let us ﬁrst of all apply the Newton-Raphson method to the linear regressionmodel (3.3) with the sum-of-squares error function (3.12). The gradient and Hessianof this error function are given by∇E(w) =N(wT φn − tn )φn = ΦT Φw − ΦT t(4.93)Tφn φTn =Φ Φ(4.94)n=1H = ∇∇E(w) =Nn=1Section 3.1.1where Φ is the N × M design matrix, whose nth row is given by φTn . The NewtonRaphson update then takes the formw(new) = w(old) − (ΦT Φ)−1 ΦT Φw(old) − ΦT t= (ΦT Φ)−1 ΦT t(4.95)which we recognize as the standard least-squares solution.

Note that the error function in this case is quadratic and hence the Newton-Raphson formula gives the exactsolution in one step.Now let us apply the Newton-Raphson update to the cross-entropy error function(4.90) for the logistic regression model. From (4.91) we see that the gradient andHessian of this error function are given by∇E(w) =N(yn − tn )φn = ΦT (y − t)(4.96)n=1H = ∇∇E(w) =Nn=1Tyn (1 − yn )φn φTn = Φ RΦ(4.97)2084. LINEAR MODELS FOR CLASSIFICATIONwhere we have made use of (4.88). Also, we have introduced the N × N diagonalmatrix R with elementsRnn = yn (1 − yn ).(4.98)Exercise 4.15We see that the Hessian is no longer constant but depends on w through the weighting matrix R, corresponding to the fact that the error function is no longer quadratic.Using the property 0 < yn < 1, which follows from the form of the logistic sigmoidfunction, we see that uT Hu > 0 for an arbitrary vector u, and so the Hessian matrixH is positive deﬁnite.

It follows that the error function is a concave function of wand hence has a unique minimum.The Newton-Raphson update formula for the logistic regression model then becomesw(new)= w(old) − (ΦT RΦ)−1 ΦT (y − t)= (ΦT RΦ)−1 ΦT RΦw(old) − ΦT (y − t)= (ΦT RΦ)−1 ΦT Rz(4.99)where z is an N -dimensional vector with elementsz = Φw(old) − R−1 (y − t).(4.100)We see that the update formula (4.99) takes the form of a set of normal equations for aweighted least-squares problem.

Because the weighing matrix R is not constant butdepends on the parameter vector w, we must apply the normal equations iteratively,each time using the new weight vector w to compute a revised weighing matrixR. For this reason, the algorithm is known as iterative reweighted least squares, orIRLS (Rubin, 1983). As in the weighted least-squares problem, the elements of thediagonal weighting matrix R can be interpreted as variances because the mean andvariance of t in the logistic regression model are given byE[t] = σ(x) = yvar[t] = E[t2 ] − E[t]2 = σ(x) − σ(x)2 = y(1 − y)(4.101)(4.102)where we have used the property t2 = t for t ∈ {0, 1}.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.