Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 78

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 78 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 782020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 78)

This twostage approach is equivalent to assuming that the output y(x) of the support vectormachine represents the log-odds of x belonging to class t = 1. Because the SVMtraining procedure is not speciﬁcally intended to encourage this, the SVM can givea poor approximation to the posterior probabilities (Tipping, 2001).7.1.2 Relation to logistic regressionSection 4.3.2As with the separable case, we can re-cast the SVM for nonseparable distributions in terms of the minimization of a regularized error function.

This will alsoallow us to highlight similarities, and differences, compared to the logistic regressionmodel.We have seen that for data points that are on the correct side of the marginboundary, and which therefore satisfy yn tn 1, we have ξn = 0, and for the7.1. Maximum Margin ClassiﬁersFigure 7.5Plot of the ‘hinge’ error function usedin support vector machines, shownin blue, along with the error functionfor logistic regression, rescaled by afactor of 1/ ln(2) so that it passesthrough the point (0, 1), shown in red.Also shown are the misclassiﬁcationerror in black and the squared errorin green.337E(z)−2−1012zremaining points we have ξn = 1 − yn tn . Thus the objective function (7.21) can bewritten (up to an overall multiplicative constant) in the formNESV (yn tn ) + λw2(7.44)n=1where λ = (2C)−1 , and ESV (·) is the hinge error function deﬁned byESV (yn tn ) = [1 − yn tn ]+(7.45)where [ · ]+ denotes the positive part.

The hinge error function, so-called becauseof its shape, is plotted in Figure 7.5. It can be viewed as an approximation to themisclassiﬁcation error, i.e., the error function that ideally we would like to minimize,which is also shown in Figure 7.5.When we considered the logistic regression model in Section 4.3.2, we found itconvenient to work with target variable t ∈ {0, 1}. For comparison with the supportvector machine, we ﬁrst reformulate maximum likelihood logistic regression usingthe target variable t ∈ {−1, 1}. To do this, we note that p(t = 1|y) = σ(y) wherey(x) is given by (7.1), and σ(y) is the logistic sigmoid function deﬁned by (4.59). Itfollows that p(t = −1|y) = 1 − σ(y) = σ(−y), where we have used the propertiesof the logistic sigmoid function, and so we can writep(t|y) = σ(yt).Exercise 7.6(7.46)From this we can construct an error function by taking the negative logarithm of thelikelihood function that, with a quadratic regularizer, takes the formNELR (yn tn ) + λw2 .(7.47)n=1whereELR (yt) = ln (1 + exp(−yt)) .(7.48)3387.

SPARSE KERNEL MACHINESFor comparison with other error functions, we can divide by ln(2) so that the errorfunction passes through the point (0, 1). This rescaled error function is also plottedin Figure 7.5 and we see that it has a similar form to the support vector error function.The key difference is that the ﬂat region in ESV (yt) leads to sparse solutions.Both the logistic error and the hinge loss can be viewed as continuous approximations to the misclassiﬁcation error. Another continuous error function that hassometimes been used to solve classiﬁcation problems is the squared error, whichis again plotted in Figure 7.5. It has the property, however, of placing increasingemphasis on data points that are correctly classiﬁed but that are a long way fromthe decision boundary on the correct side.

Such points will be strongly weighted atthe expense of misclassiﬁed points, and so if the objective is to minimize the misclassiﬁcation rate, then a monotonically decreasing error function would be a betterchoice.7.1.3 Multiclass SVMsThe support vector machine is fundamentally a two-class classiﬁer. In practice,however, we often have to tackle problems involving K > 2 classes. Various methods have therefore been proposed for combining multiple two-class SVMs in orderto build a multiclass classiﬁer.One commonly used approach (Vapnik, 1998) is to construct K separate SVMs,in which the k th model yk (x) is trained using the data from class Ck as the positiveexamples and the data from the remaining K − 1 classes as the negative examples.This is known as the one-versus-the-rest approach. However, in Figure 4.2 we sawthat using the decisions of the individual classiﬁers can lead to inconsistent resultsin which an input is assigned to multiple classes simultaneously.

This problem issometimes addressed by making predictions for new inputs x usingy(x) = max yk (x).k(7.49)Unfortunately, this heuristic approach suffers from the problem that the differentclassiﬁers were trained on different tasks, and there is no guarantee that the realvalued quantities yk (x) for different classiﬁers will have appropriate scales.Another problem with the one-versus-the-rest approach is that the training setsare imbalanced.

For instance, if we have ten classes each with equal numbers oftraining data points, then the individual classiﬁers are trained on data sets comprising90% negative examples and only 10% positive examples, and the symmetry of theoriginal problem is lost. A variant of the one-versus-the-rest scheme was proposedby Lee et al. (2001) who modify the target values so that the positive class has target+1 and the negative class has target −1/(K − 1).Weston and Watkins (1999) deﬁne a single objective function for training allK SVMs simultaneously, based on maximizing the margin from each to remainingclasses. However, this can result in much slower training because, instead of solvingK separate optimization problems each over N data points with an overall cost ofO(KN 2 ), a single optimization problem of size (K − 1)N must be solved giving anoverall cost of O(K 2 N 2 ).7.1.

Maximum Margin Classiﬁers339Another approach is to train K(K − 1)/2 different 2-class SVMs on all possiblepairs of classes, and then to classify test points according to which class has the highest number of ‘votes’, an approach that is sometimes called one-versus-one. Again,we saw in Figure 4.2 that this can lead to ambiguities in the resulting classiﬁcation.Also, for large K this approach requires signiﬁcantly more training time than theone-versus-the-rest approach. Similarly, to evaluate test points, signiﬁcantly morecomputation is required.The latter problem can be alleviated by organizing the pairwise classiﬁers intoa directed acyclic graph (not to be confused with a probabilistic graphical model)leading to the DAGSVM (Platt et al., 2000).

For K classes, the DAGSVM has a totalof K(K − 1)/2 classiﬁers, and to classify a new test point only K − 1 pairwiseclassiﬁers need to be evaluated, with the particular classiﬁers used depending onwhich path through the graph is traversed.A different approach to multiclass classiﬁcation, based on error-correcting output codes, was developed by Dietterich and Bakiri (1995) and applied to supportvector machines by Allwein et al.

(2000). This can be viewed as a generalization ofthe voting scheme of the one-versus-one approach in which more general partitionsof the classes are used to train the individual classiﬁers. The K classes themselvesare represented as particular sets of responses from the two-class classiﬁers chosen,and together with a suitable decoding scheme, this gives robustness to errors and toambiguity in the outputs of the individual classiﬁers. Although the application ofSVMs to multiclass classiﬁcation problems remains an open issue, in practice theone-versus-the-rest approach is the most widely used in spite of its ad-hoc formulation and its practical limitations.There are also single-class support vector machines, which solve an unsupervised learning problem related to probability density estimation.

Instead of modelling the density of data, however, these methods aim to ﬁnd a smooth boundaryenclosing a region of high density. The boundary is chosen to represent a quantile ofthe density, that is, the probability that a data point drawn from the distribution willland inside that region is given by a ﬁxed number between 0 and 1 that is speciﬁed inadvance. This is a more restricted problem than estimating the full density but maybe sufﬁcient in speciﬁc applications. Two approaches to this problem using supportvector machines have been proposed. The algorithm of Schölkopf et al. (2001) triesto ﬁnd a hyperplane that separates all but a ﬁxed fraction ν of the training data fromthe origin while at the same time maximizing the distance (margin) of the hyperplanefrom the origin, while Tax and Duin (1999) look for the smallest sphere in featurespace that contains all but a fraction ν of the data points.

For kernels k(x, x ) thatare functions only of x − x , the two algorithms are equivalent.7.1.4 SVMs for regressionSection 3.1.4We now extend support vector machines to regression problems while at thesame time preserving the property of sparseness. In simple linear regression, we3407. SPARSE KERNEL MACHINESFigure 7.6Plot of an -insensitive error function (inred) in which the error increases linearly with distance beyond the insensitive region. Also shown for comparison is the quadratic error function (ingreen).E(z)−0zminimize a regularized error function given by1λ2{yn − tn } + w2 .22N(7.50)n=1To obtain sparse solutions, the quadratic error function is replaced by an -insensitiveerror function (Vapnik, 1995), which gives zero error if the absolute difference between the prediction y(x) and the target t is less than where > 0.

A simpleexample of an -insensitive error function, having a linear cost associated with errorsoutside the insensitive region, is given by0,if |y(x) − t| < ;(7.51)E (y(x) − t) =|y(x) − t| − , otherwiseand is illustrated in Figure 7.6.We therefore minimize a regularized error function given byCNn=11E (y(xn ) − tn ) + w22(7.52)where y(x) is given by (7.1).

By convention the (inverse) regularization parameter,denoted C, appears in front of the error term.As before, we can re-express the optimization problem by introducing slackvariables. For each data point xn , we now need two slack variables ξn 0 andξn 0, where ξn > 0 corresponds to a point for which tn > y(xn ) + , and ξn > 0corresponds to a point for which tn < y(xn ) − , as illustrated in Figure 7.7.The condition for a target point to lie inside the -tube is that yn − tn yn +, where yn = y(xn ). Introducing the slack variables allows points to lie outsidethe tube provided the slack variables are nonzero, and the corresponding conditionsaretntn y(xn ) + + ξn y(xn ) − − ξn .(7.53)(7.54)7.1.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.