Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 77

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 77 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 772020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 77)

We now modify thisapproach so that data points are allowed to be on the ‘wrong side’ of the marginboundary, but with a penalty that increases with the distance from that boundary. Forthe subsequent optimization problem, it is convenient to make this penalty a linearfunction of this distance. To do this, we introduce slack variables, ξn 0 wheren = 1, . . . , N , with one slack variable for each training data point (Bennett, 1992;Cortes and Vapnik, 1995). These are deﬁned by ξn = 0 for data points that are on orinside the correct margin boundary and ξn = |tn − y(xn )| for other points.

Thus adata point that is on the decision boundary y(xn ) = 0 will have ξn = 1, and points3327. SPARSE KERNEL MACHINESFigure 7.3Illustration of the slack variables ξn 0.Data points with circles around them aresupport vectors.y = −1y=0y=1ξ>1ξ<1ξ=0ξ=0with ξn > 1 will be misclassiﬁed. The exact classiﬁcation constraints (7.5) are thenreplaced withn = 1, . . .

, N(7.20)tn y(xn ) 1 − ξn ,in which the slack variables are constrained to satisfy ξn 0. Data points for whichξn = 0 are correctly classiﬁed and are either on the margin or on the correct sideof the margin. Points for which 0 < ξn 1 lie inside the margin, but on the correct side of the decision boundary, and those data points for which ξn > 1 lie onthe wrong side of the decision boundary and are misclassiﬁed, as illustrated in Figure 7.3. This is sometimes described as relaxing the hard margin constraint to give asoft margin and allows some of the training set data points to be misclassiﬁed.

Notethat while slack variables allow for overlapping class distributions, this framework isstill sensitive to outliers because the penalty for misclassiﬁcation increases linearlywith ξ.Our goal is now to maximize the margin while softly penalizing points that lieon the wrong side of the margin boundary. We therefore minimizeCNn=11ξn + w22(7.21)where the parameter C > 0 controls the trade-off between the slack variable penaltyand the margin.

Because any point that is misclassiﬁed has ξn > 1, it follows thatn ξn is an upper bound on the number of misclassiﬁed points. The parameter C istherefore analogous to (the inverse of) a regularization coefﬁcient because it controlsthe trade-off between minimizing training errors and controlling model complexity.In the limit C → ∞, we will recover the earlier support vector machine for separabledata.We now wish to minimize (7.21) subject to the constraints (7.20) together withξn 0. The corresponding Lagrangian is given byL(w, b, a) =1ξn −an {tn y(xn ) − 1 + ξn } −µn ξn (7.22)w2 + C2NNNn=1n=1n=17.1.

Maximum Margin ClassiﬁersAppendix E333where {an 0} and {µn 0} are Lagrange multipliers. The corresponding set ofKKT conditions are given byantn y(xn ) − 1 + ξnan (tn y(xn ) − 1 + ξn )µnξnµn ξn==000000(7.23)(7.24)(7.25)(7.26)(7.27)(7.28)where n = 1, . . . , N .We now optimize out w, b, and {ξn } making use of the deﬁnition (7.1) of y(x)to give∂Lan tn φ(xn )=0 ⇒ w=∂wN(7.29)n=1∂L=0 ⇒∂bNan tn = 0(7.30)n=1∂L= 0 ⇒ an = C − µn .∂ξn(7.31)Using these results to eliminate w, b, and {ξn } from the Lagrangian, we obtain thedual Lagrangian in the formL(a)=Nn=11an am tn tm k(xn , xm )2Nan −N(7.32)n=1 m=1which is identical to the separable case, except that the constraints are somewhatdifferent.

To see what these constraints are, we note that an 0 is required becausethese are Lagrange multipliers. Furthermore, (7.31) together with µn 0 impliesan C. We therefore have to minimize (7.32) with respect to the dual variables{an } subject to0 an CNan tn = 0(7.33)(7.34)n=1for n = 1, . . .

, N , where (7.33) are known as box constraints. This again representsa quadratic programming problem. If we substitute (7.29) into (7.1), we see thatpredictions for new data points are again made by using (7.13).We can now interpret the resulting solution. As before, a subset of the datapoints may have an = 0, in which case they do not contribute to the predictive3347. SPARSE KERNEL MACHINESmodel (7.13). The remaining data points constitute the support vectors. These havean > 0 and hence from (7.25) must satisfytn y(xn ) = 1 − ξn .(7.35)If an < C, then (7.31) implies that µn > 0, which from (7.28) requires ξn = 0 andhence such points lie on the margin. Points with an = C can lie inside the marginand can either be correctly classiﬁed if ξn 1 or misclassiﬁed if ξn > 1.To determine the parameter b in (7.1), we note that those support vectors forwhich 0 < an < C have ξn = 0 so that tn y(xn ) = 1 and hence will satisfyam tm k(xn , xm ) + b = 1.(7.36)tnm∈SAgain, a numerically stable solution is obtained by averaging to give1 b=tn −am tm k(xn , xm )NMn∈M(7.37)m∈Swhere M denotes the set of indices of data points having 0 < an < C.An alternative, equivalent formulation of the support vector machine, known asthe ν-SVM, has been proposed by Schölkopf et al.

(2000). This involves maximizing1L(a)=−2N Nan am tn tm k(xn , xm )(7.38)n=1 m=1subject to the constraints0 an 1/NNan tn = 0(7.39)(7.40)n=1Nan ν.(7.41)n=1This approach has the advantage that the parameter ν, which replaces C, can beinterpreted as both an upper bound on the fraction of margin errors (points for whichξn > 0 and hence which lie on the wrong side of the margin boundary and which mayor may not be misclassiﬁed) and a lower bound on the fraction of support vectors.

Anexample of the ν-SVM applied to a synthetic data set is shown in Figure 7.4. HereGaussian kernels of the form exp (−γx − x 2 ) have been used, with γ = 0.45.Although predictions for new inputs are made using only the support vectors,the training phase (i.e., the determination of the parameters a and b) makes use ofthe whole data set, and so it is important to have efﬁcient algorithms for solving3357.1. Maximum Margin ClassiﬁersFigure 7.4Illustration of the ν-SVM appliedto a nonseparable data set in twodimensions.

The support vectorsare indicated by circles.20−2−202the quadratic programming problem. We ﬁrst note that the objective function L(a)given by (7.10) or (7.32) is quadratic and so any local optimum will also be a globaloptimum provided the constraints deﬁne a convex region (which they do as a consequence of being linear). Direct solution of the quadratic programming problem using traditional techniques is often infeasible due to the demanding computation andmemory requirements, and so more practical approaches need to be found. The technique of chunking (Vapnik, 1982) exploits the fact that the value of the Lagrangianis unchanged if we remove the rows and columns of the kernel matrix correspondingto Lagrange multipliers that have value zero.

This allows the full quadratic programming problem to be broken down into a series of smaller ones, whose goal iseventually to identify all of the nonzero Lagrange multipliers and discard the others.Chunking can be implemented using protected conjugate gradients (Burges, 1998).Although chunking reduces the size of the matrix in the quadratic function from thenumber of data points squared to approximately the number of nonzero Lagrangemultipliers squared, even this may be too big to ﬁt in memory for large-scale applications. Decomposition methods (Osuna et al., 1996) also solve a series of smallerquadratic programming problems but are designed so that each of these is of a ﬁxedsize, and so the technique can be applied to arbitrarily large data sets. However, itstill involves numerical solution of quadratic programming subproblems and thesecan be problematic and expensive. One of the most popular approaches to trainingsupport vector machines is called sequential minimal optimization, or SMO (Platt,1999).

It takes the concept of chunking to the extreme limit and considers just twoLagrange multipliers at a time. In this case, the subproblem can be solved analytically, thereby avoiding numerical quadratic programming altogether. Heuristics aregiven for choosing the pair of Lagrange multipliers to be considered at each step.In practice, SMO is found to have a scaling with the number of data points that issomewhere between linear and quadratic depending on the particular application.We have seen that kernel functions correspond to inner products in feature spacesthat can have high, or even inﬁnite, dimensionality.

By working directly in terms ofthe kernel function, without introducing the feature space explicitly, it might therefore seem that support vector machines somehow manage to avoid the curse of di-3367. SPARSE KERNEL MACHINESSection 1.4mensionality. This is not the case, however, because there are constraints amongstthe feature values that restrict the effective dimensionality of feature space. To seethis consider a simple second-order polynomial kernel that we can expand in termsof its components2k(x, z) = 1 + xT z = (1 + x1 z1 + x2 z2 )2= 1 + 2x1 z1 + 2x2 z2 + x21 z12 + 2x1 z1 x2 z2 + x22 z22√√√√√√= (1, 2x1 , 2x2 , x21 , 2x1 x2 , x22 )(1, 2z1 , 2z2 , z12 , 2z1 z2 , z22 )T= φ(x)T φ(z).(7.42)This kernel function therefore represents an inner product in a feature space havingsix dimensions, in which the mapping from input space to feature space is describedby the vector function φ(x). However, the coefﬁcients weighting these differentfeatures are constrained to have speciﬁc forms.

Thus any set of points in the originaltwo-dimensional space x would be constrained to lie exactly on a two-dimensionalnonlinear manifold embedded in the six-dimensional feature space.We have already highlighted the fact that the support vector machine does notprovide probabilistic outputs but instead makes classiﬁcation decisions for new input vectors. Veropoulos et al. (1999) discuss modiﬁcations to the SVM to allowthe trade-off between false positive and false negative errors to be controlled.

However, if we wish to use the SVM as a module in a larger probabilistic system, thenprobabilistic predictions of the class label t for new inputs x are required.To address this issue, Platt (2000) has proposed ﬁtting a logistic sigmoid to theoutputs of a previously trained support vector machine. Speciﬁcally, the requiredconditional probability is assumed to be of the formp(t = 1|x) = σ (Ay(x) + B)(7.43)where y(x) is deﬁned by (7.1). Values for the parameters A and B are found byminimizing the cross-entropy error function deﬁned by a training set consisting ofpairs of values y(xn ) and tn . The data used to ﬁt the sigmoid needs to be independentof that used to train the original SVM in order to avoid severe over-ﬁtting.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.