Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 76

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 76 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 762020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 76)

We can use this freedom to settn wT φ(xn ) + b = 1(7.4)for the point that is closest to the surface. In this case, all data points will satisfy theconstraintstn wT φ(xn ) + b 1,n = 1, . . . , N.(7.5)This is known as the canonical representation of the decision hyperplane. In thecase of data points for which the equality holds, the constraints are said to be active,whereas for the remainder they are said to be inactive.

By deﬁnition, there willalways be at least one active constraint, because there will always be a closest point,and once the margin has been maximized there will be at least two active constraints.The optimization problem then simply requires that we maximize w−1 , which isequivalent to minimizing w2 , and so we have to solve the optimization problem1arg min w22w,bAppendix E(7.6)subject to the constraints given by (7.5). The factor of 1/2 in (7.6) is included forlater convenience.

This is an example of a quadratic programming problem in whichwe are trying to minimize a quadratic function subject to a set of linear inequalityconstraints. It appears that the bias parameter b has disappeared from the optimization. However, it is determined implicitly via the constraints, because these requirethat changes to w be compensated by changes to b. We shall see how this worksshortly.In order to solve this constrained optimization problem, we introduce Lagrangemultipliers an 0, with one multiplier an for each of the constraints in (7.5), givingthe Lagrangian functionN12an tn (wT φ(xn ) + b) − 1L(w, b, a) = w −2(7.7)n=1where a = (a1 , . .

. , aN )T . Note the minus sign in front of the Lagrange multiplierterm, because we are minimizing with respect to w and b, and maximizing withrespect to a. Setting the derivatives of L(w, b, a) with respect to w and b equal tozero, we obtain the following two conditionsw =Nan tn φ(xn )(7.8)an tn .(7.9)n=10 =Nn=17.1.

Maximum Margin Classiﬁers329Eliminating w and b from L(w, b, a) using these conditions then gives the dualrepresentation of the maximum margin problem in which we maximizeL(a)=N1an am tn tm k(xn , xm )2Nan −n=1N(7.10)n=1 m=1with respect to a subject to the constraintsanN 0,n = 1, . . . , N,an tn = 0.(7.11)(7.12)n=1Here the kernel function is deﬁned by k(x, x ) = φ(x)T φ(x ). Again, this takes theform of a quadratic programming problem in which we optimize a quadratic functionof a subject to a set of inequality constraints. We shall discuss techniques for solvingsuch quadratic programming problems in Section 7.1.1.The solution to a quadratic programming problem in M variables in general hascomputational complexity that is O(M 3 ).

In going to the dual formulation we haveturned the original optimization problem, which involved minimizing (7.6) over Mvariables, into the dual problem (7.10), which has N variables. For a ﬁxed set ofbasis functions whose number M is smaller than the number N of data points, themove to the dual problem appears disadvantageous. However, it allows the model tobe reformulated using kernels, and so the maximum margin classiﬁer can be appliedefﬁciently to feature spaces whose dimensionality exceeds the number of data points,including inﬁnite feature spaces. The kernel formulation also makes clear the roleof the constraint that the kernel function k(x, x ) be positive deﬁnite, because thisis bounded below, giving rise to a wellensures that the Lagrangian function L(a)deﬁned optimization problem.In order to classify new data points using the trained model, we evaluate the signof y(x) deﬁned by (7.1).

This can be expressed in terms of the parameters {an } andthe kernel function by substituting for w using (7.8) to givey(x) =Nan tn k(x, xn ) + b.(7.13)n=1Joseph-Louis Lagrange1736–1813Although widely considered to bea French mathematician, Lagrangewas born in Turin in Italy. By the ageof nineteen, he had already madeimportant contributions mathematics and had been appointed as Professor at the Royal Artillery School in Turin. For manyyears, Euler worked hard to persuade Lagrange tomove to Berlin, which he eventually did in 1766 wherehe succeeded Euler as Director of Mathematics atthe Berlin Academy. Later he moved to Paris, narrowly escaping with his life during the French revolution thanks to the personal intervention of Lavoisier(the French chemist who discovered oxygen) who himself was later executed at the guillotine.

Lagrangemade key contributions to the calculus of variationsand the foundations of dynamics.3307. SPARSE KERNEL MACHINESIn Appendix E, we show that a constrained optimization of this form satisﬁes theKarush-Kuhn-Tucker (KKT) conditions, which in this case require that the followingthree properties holdan 0tn y(xn ) − 1 0an {tn y(xn ) − 1} = 0.(7.14)(7.15)(7.16)Thus for every data point, either an = 0 or tn y(xn ) = 1. Any data point forwhich an = 0 will not appear in the sum in (7.13) and hence plays no role in makingpredictions for new data points. The remaining data points are called support vectors,and because they satisfy tn y(xn ) = 1, they correspond to points that lie on themaximum margin hyperplanes in feature space, as illustrated in Figure 7.1.

Thisproperty is central to the practical applicability of support vector machines. Oncethe model is trained, a signiﬁcant proportion of the data points can be discarded andonly the support vectors retained.Having solved the quadratic programming problem and found a value for a, wecan then determine the value of the threshold parameter b by noting that any supportvector xn satisﬁes tn y(xn ) = 1. Using (7.13) this givesam tm k(xn , xm ) + b = 1(7.17)tnm∈Swhere S denotes the set of indices of the support vectors.

Although we can solvethis equation for b using an arbitrarily chosen support vector xn , a numerically morestable solution is obtained by ﬁrst multiplying through by tn , making use of t2n = 1,and then averaging these equations over all support vectors and solving for b to give1 b=tn −am tm k(xn , xm )(7.18)NSn∈Sm∈Swhere NS is the total number of support vectors.For later comparison with alternative models, we can express the maximummargin classiﬁer in terms of the minimization of an error function, with a simplequadratic regularizer, in the formNE∞ (y(xn )tn − 1) + λw2(7.19)n=1where E∞ (z) is a function that is zero if z 0 and ∞ otherwise and ensures thatthe constraints (7.5) are satisﬁed. Note that as long as the regularization parametersatisﬁes λ > 0, its precise value plays no role.Figure 7.2 shows an example of the classiﬁcation resulting from training a support vector machine on a simple synthetic data set using a Gaussian kernel of the7.1.

Maximum Margin ClassiﬁersFigure 7.2331Example of synthetic data fromtwo classes in two dimensionsshowing contours of constanty(x) obtained from a supportvector machine having a Gaussian kernel function. Also shownare the decision boundary, themargin boundaries, and the support vectors.form (6.23). Although the data set is not linearly separable in the two-dimensionaldata space x, it is linearly separable in the nonlinear feature space deﬁned implicitlyby the nonlinear kernel function.

Thus the training data points are perfectly separatedin the original data space.This example also provides a geometrical insight into the origin of sparsity inthe SVM. The maximum margin hyperplane is deﬁned by the location of the supportvectors. Other data points can be moved around freely (so long as they remain outside the margin region) without changing the decision boundary, and so the solutionwill be independent of such data points.7.1.1 Overlapping class distributionsSo far, we have assumed that the training data points are linearly separable in thefeature space φ(x).

The resulting support vector machine will give exact separationof the training data in the original input space x, although the corresponding decisionboundary will be nonlinear. In practice, however, the class-conditional distributionsmay overlap, in which case exact separation of the training data can lead to poorgeneralization.We therefore need a way to modify the support vector machine so as to allowsome of the training points to be misclassiﬁed. From (7.19) we see that in the caseof separable classes, we implicitly used an error function that gave inﬁnite errorif a data point was misclassiﬁed and zero error if it was classiﬁed correctly, andthen optimized the model parameters to maximize the margin.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.