Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 50

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 50 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 502020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 50)

In fact, we can interpret IRLSas the solution to a linearized problem in the space of the variable a = wT φ. Thequantity zn , which corresponds to the nth element of z, can then be given a simpleinterpretation as an effective target value in this space obtained by making a locallinear approximation to the logistic sigmoid function around the current operatingpoint w(old)dan (old)an (w) an (w)+(tn − yn )dy n w(old)(yn − tn )(old)−(4.103)= zn .= φTnwyn (1 − yn )4.3.

Probabilistic Discriminative Models2094.3.4 Multiclass logistic regressionSection 4.2In our discussion of generative models for multiclass classiﬁcation, we haveseen that for a large class of distributions, the posterior probabilities are given by asoftmax transformation of linear functions of the feature variables, so thatexp(ak )p(Ck |φ) = yk (φ) = j exp(aj )(4.104)where the ‘activations’ ak are given byak = wkT φ.Exercise 4.17(4.105)There we used maximum likelihood to determine separately the class-conditionaldensities and the class priors and then found the corresponding posterior probabilitiesusing Bayes’ theorem, thereby implicitly determining the parameters {wk }.

Here weconsider the use of maximum likelihood to determine the parameters {wk } of thismodel directly. To do this, we will require the derivatives of yk with respect to all ofthe activations aj . These are given by∂yk= yk (Ikj − yj )∂aj(4.106)where Ikj are the elements of the identity matrix.Next we write down the likelihood function. This is most easily done usingthe 1-of-K coding scheme in which the target vector tn for a feature vector φnbelonging to class Ck is a binary vector with all elements zero except for element k,which equals one. The likelihood function is then given byp(T|w1 , . . .

, wK ) =N Kp(Ck |φn )tnk =n=1 k=1N Kn=1 k=1tnkynk(4.107)where ynk = yk (φn ), and T is an N × K matrix of target variables with elementstnk . Taking the negative logarithm then givesE(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = −N Ktnk ln ynk(4.108)n=1 k=1Exercise 4.18which is known as the cross-entropy error function for the multiclass classiﬁcationproblem.We now take the gradient of the error function with respect to one of the parameter vectors wj . Making use of the result (4.106) for the derivatives of the softmaxfunction, we obtain∇wj E(w1 , .

. . , wK ) =Nn=1(ynj − tnj ) φn(4.109)2104. LINEAR MODELS FOR CLASSIFICATIONwhere we have made use of k tnk = 1. Once again, we see the same form arisingfor the gradient as was found for the sum-of-squares error function with the linearmodel and the cross-entropy error for the logistic regression model, namely the product of the error (ynj − tnj ) times the basis function φn . Again, we could use thisto formulate a sequential algorithm in which patterns are presented one at a time, inwhich each of the weight vectors is updated using (3.22).We have seen that the derivative of the log likelihood function for a linear regression model with respect to the parameter vector w for a data point n took the formof the ‘error’ yn − tn times the feature vector φn .

Similarly, for the combinationof logistic sigmoid activation function and cross-entropy error function (4.90), andfor the softmax activation function with the multiclass cross-entropy error function(4.108), we again obtain this same simple form. This is an example of a more generalresult, as we shall see in Section 4.3.6.To ﬁnd a batch algorithm, we again appeal to the Newton-Raphson update toobtain the corresponding IRLS algorithm for the multiclass problem. This requiresevaluation of the Hessian matrix that comprises blocks of size M × M in whichblock j, k is given by∇wk ∇wj E(w1 , .

. . , wK ) = −Nynk (Ikj − ynj )φn φTn.(4.110)n=1Exercise 4.20As with the two-class problem, the Hessian matrix for the multiclass logistic regression model is positive deﬁnite and so the error function again has a unique minimum.Practical details of IRLS for the multiclass case can be found in Bishop and Nabney(2008).4.3.5 Probit regressionWe have seen that, for a broad range of class-conditional distributions, describedby the exponential family, the resulting posterior class probabilities are given by alogistic (or softmax) transformation acting on a linear function of the feature variables. However, not all choices of class-conditional density give rise to such a simpleform for the posterior probabilities (for instance, if the class-conditional densities aremodelled using Gaussian mixtures).

This suggests that it might be worth exploringother types of discriminative probabilistic model. For the purposes of this chapter,however, we shall return to the two-class case, and again remain within the framework of generalized linear models so thatp(t = 1|a) = f (a)(4.111)where a = wT φ, and f (·) is the activation function.One way to motivate an alternative choice for the link function is to consider anoisy threshold model, as follows. For each input φn , we evaluate an = wT φn andthen we set the target value according totn = 1 if an θ(4.112)tn = 0 otherwise.2114.3. Probabilistic Discriminative ModelsFigure 4.13 Schematic example of a probability density p(θ)shown by the blue curve, given in this example by a mixtureof two Gaussians, along with its cumulative distribution functionf (a), shown by the red curve.

Note that the value of the bluecurve at any point, such as that indicated by the vertical greenline, corresponds to the slope of the red curve at the same point.Conversely, the value of the red curve at this point correspondsto the area under the blue curve indicated by the shaded greenregion. In the stochastic threshold model, the class label takesthe value t = 1 if the value of a = wT φ exceeds a threshold, otherwise it takes the value t = 0.

This is equivalent to an activationfunction given by the cumulative distribution function f (a).10.80.60.40.2001234If the value of θ is drawn from a probability density p(θ), then the correspondingactivation function will be given by the cumulative distribution function ap(θ) dθ(4.113)f (a) =−∞as illustrated in Figure 4.13.As a speciﬁc example, suppose that the density p(θ) is given by a zero mean,unit variance Gaussian. The corresponding cumulative distribution function is givenbyaΦ(a) =−∞N (θ|0, 1) dθ(4.114)which is known as the probit function. It has a sigmoidal shape and is comparedwith the logistic sigmoid function in Figure 4.9. Note that the use of a more general Gaussian distribution does not change the model because this is equivalent toa re-scaling of the linear coefﬁcients w.

Many numerical packages provide for theevaluation of a closely related function deﬁned by a2exp(−θ2 /2) dθ(4.115)erf(a) = √π 0Exercise 4.21and known as the erf function or error function (not to be confused with the errorfunction of a machine learning model).

It is related to the probit function by11√erf(a) .1+(4.116)Φ(a) =22The generalized linear model based on a probit activation function is known as probitregression.We can determine the parameters of this model using maximum likelihood, by astraightforward extension of the ideas discussed earlier. In practice, the results foundusing probit regression tend to be similar to those of logistic regression. We shall,2124. LINEAR MODELS FOR CLASSIFICATIONhowever, ﬁnd another use for the probit model when we discuss Bayesian treatmentsof logistic regression in Section 4.5.One issue that can occur in practical applications is that of outliers, which canarise for instance through errors in measuring the input vector x or through mislabelling of the target value t.

Because such points can lie a long way to the wrong sideof the ideal decision boundary, they can seriously distort the classiﬁer. Note that thelogistic and probit regression models behave differently in this respect because thetails of the logistic sigmoid decay asymptotically like exp(−x) for x → ∞, whereasfor the probit activation function they decay like exp(−x2 ), and so the probit modelcan be signiﬁcantly more sensitive to outliers.However, both the logistic and the probit models assume the data is correctlylabelled.

The effect of mislabelling is easily incorporated into a probabilistic modelby introducing a probability that the target value t has been ﬂipped to the wrongvalue (Opper and Winther, 2000a), leading to a target value distribution for data pointx of the formp(t|x) = (1 − )σ(x) + (1 − σ(x))= + (1 − 2)σ(x)(4.117)where σ(x) is the activation function with input vector x. Here may be set inadvance, or it may be treated as a hyperparameter whose value is inferred from thedata.4.3.6 Canonical link functionsFor the linear regression model with a Gaussian noise distribution, the errorfunction, corresponding to the negative log likelihood, is given by (3.12).

If we takethe derivative with respect to the parameter vector w of the contribution to the errorfunction from a data point n, this takes the form of the ‘error’ yn − tn times thefeature vector φn , where yn = wT φn . Similarly, for the combination of the logisticsigmoid activation function and the cross-entropy error function (4.90), and for thesoftmax activation function with the multiclass cross-entropy error function (4.108),we again obtain this same simple form.

We now show that this is a general resultof assuming a conditional distribution for the target variable from the exponentialfamily, along with a corresponding choice for the activation function known as thecanonical link function.We again make use of the restricted form (4.84) of exponential family distributions. Note that here we are applying the assumption of exponential family distribution to the target variable t, in contrast to Section 4.2.4 where we applied it to theinput vector x. We therefore consider conditional distributions of the target variableof the form ηt 1 tg(η) exp.(4.118)p(t|η, s) = hsssUsing the same line of argument as led to the derivation of the result (2.226), we seethat the conditional mean of t, which we denote by y, is given byy ≡ E[t|η] = −sdln g(η).dη(4.119)4.4.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.