Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 75

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 75 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 752020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 75)

. . , xN and a nonlinear feature mapping φ(x).Suppose that the dependence of the error function on w takes the formJ(w) = f (wT φ(x1 ), . . . , wT φ(xN )) + g(wT w)(6.97)where g(·) is a monotonically increasing function. By writing w in the formw=Nαn φ(xn ) + w⊥(6.98)n=1show that the value of w that minimizes J(w) takes the form of a linear combinationof the basis functions φ(xn ) for n = 1, . . . , N .6.17 ( ) www Consider the sum-of-squares error function (6.39) for data havingnoisy inputs, where ν(ξ) is the distribution of the noise. Use the calculus of variations to minimize this error function with respect to the function y(x), and henceshow that the optimal solution is given by an expansion of the form (6.40) in whichthe basis functions are given by (6.41).3226. KERNEL METHODS6.18 () Consider a Nadaraya-Watson model with one input variable x and one targetvariable t having Gaussian components with isotropic covariances, so that the covariance matrix is given by σ 2 I where I is the unit matrix.

Write down expressionsfor the conditional density p(t|x) and for the conditional mean E[t|x] and variancevar[t|x], in terms of the kernel function k(x, xn ).6.19 ( ) Another viewpoint on kernel regression comes from a consideration of regression problems in which the input variables as well as the target variables arecorrupted with additive noise. Suppose each target value tn is generated as usualby taking a function y(zn ) evaluated at a point zn , and adding Gaussian noise. Thevalue of zn is not directly observed, however, but only a noise corrupted versionxn = zn + ξ n where the random variable ξ is governed by some distribution g(ξ).Consider a set of observations {xn , tn }, where n = 1, . . .

, N , together with a corresponding sum-of-squares error function deﬁned by averaging over the distributionof input noise to give1E=2N2{y(xn − ξ n ) − tn } g(ξ n ) dξn .(6.99)n=1By minimizing E with respect to the function y(z) using the calculus of variations(Appendix D), show that optimal solution for y(x) is given by a Nadaraya-Watsonkernel regression solution of the form (6.45) with a kernel of the form (6.46).6.20 ( ) wwwVerify the results (6.66) and (6.67).6.21 ( ) www Consider a Gaussian process regression model in which the kernelfunction is deﬁned in terms of a ﬁxed set of nonlinear basis functions. Show that thepredictive distribution is identical to the result (3.58) obtained in Section 3.3.2 for theBayesian linear regression model. To do this, note that both models have Gaussianpredictive distributions, and so it is only necessary to show that the conditional meanand variance are the same.

For the mean, make use of the matrix identity (C.6), andfor the variance, make use of the matrix identity (C.7).6.22 ( ) Consider a regression problem with N training set input vectors x1 , . . . , xNand L test set input vectors xN +1 , . . . , xN +L , and suppose we deﬁne a Gaussianprocess prior over functions t(x). Derive an expression for the joint predictive distribution for t(xN +1 ), . . . , t(xN +L ), given the values of t(x1 ), . . . , t(xN ). Show themarginal of this distribution for one of the test observations tj where N + 1 j N + L is given by the usual Gaussian process regression result (6.66) and (6.67).6.23 ( ) www Consider a Gaussian process regression model in which the targetvariable t has dimensionality D.

Write down the conditional distribution of tN +1for a test input vector xN +1 , given a training set of input vectors x1 , . . . , xN +1 andcorresponding target observations t1 , . . . , tN .6.24 () Show that a diagonal matrix W whose elements satisfy 0 < Wii < 1 is positivedeﬁnite. Show that the sum of two positive deﬁnite matrices is itself positive deﬁnite.Exercises3236.25 () www Using the Newton-Raphson formula (4.92), derive the iterative updateformula (6.83) for ﬁnding the mode aN of the posterior distribution in the Gaussianprocess classiﬁcation model.6.26 () Using the result (2.115), derive the expressions (6.87) and (6.88) for the meanand variance of the posterior distribution p(aN +1 |tN ) in the Gaussian process classiﬁcation model.6.27 ( ) Derive the result (6.90) for the log likelihood function in the Laplace approximation framework for Gaussian process classiﬁcation.

Similarly, derive the results(6.91), (6.92), and (6.94) for the terms in the gradient of the log likelihood.7Sparse KernelMachinesIn the previous chapter, we explored a variety of learning algorithms based on nonlinear kernels. One of the signiﬁcant limitations of many such algorithms is thatthe kernel function k(xn , xm ) must be evaluated for all possible pairs xn and xmof training points, which can be computationally infeasible during training and canlead to excessive computation times when making predictions for new data points.In this chapter we shall look at kernel-based algorithms that have sparse solutions,so that predictions for new inputs depend only on the kernel function evaluated at asubset of the training data points.We begin by looking in some detail at the support vector machine (SVM), whichbecame popular in some years ago for solving problems in classiﬁcation, regression,and novelty detection.

An important property of support vector machines is that thedetermination of the model parameters corresponds to a convex optimization problem, and so any local solution is also a global optimum. Because the discussion ofsupport vector machines makes extensive use of Lagrange multipliers, the reader is3253267.

SPARSE KERNEL MACHINESSection 7.2encouraged to review the key concepts covered in Appendix E. Additional information on support vector machines can be found in Vapnik (1995), Burges (1998),Cristianini and Shawe-Taylor (2000), Müller et al. (2001), Schölkopf and Smola(2002), and Herbrich (2002).The SVM is a decision machine and so does not provide posterior probabilities.We have already discussed some of the beneﬁts of determining probabilities in Section 1.5.4.

An alternative sparse kernel technique, known as the relevance vectormachine (RVM), is based on a Bayesian formulation and provides posterior probabilistic outputs, as well as having typically much sparser solutions than the SVM.7.1. Maximum Margin ClassiﬁersWe begin our discussion of support vector machines by returning to the two-classclassiﬁcation problem using linear models of the formy(x) = wT φ(x) + bSection 7.1.5(7.1)where φ(x) denotes a ﬁxed feature-space transformation, and we have made thebias parameter b explicit. Note that we shall shortly introduce a dual representationexpressed in terms of kernel functions, which avoids having to work explicitly infeature space. The training data set comprises N input vectors x1 , . .

. , xN , withcorresponding target values t1 , . . . , tN where tn ∈ {−1, 1}, and new data points xare classiﬁed according to the sign of y(x).We shall assume for the moment that the training data set is linearly separable infeature space, so that by deﬁnition there exists at least one choice of the parametersw and b such that a function of the form (7.1) satisﬁes y(xn ) > 0 for points havingtn = +1 and y(xn ) < 0 for points having tn = −1, so that tn y(xn ) > 0 for alltraining data points.There may of course exist many such solutions that separate the classes exactly.In Section 4.1.7, we described the perceptron algorithm that is guaranteed to ﬁnda solution in a ﬁnite number of steps.

The solution that it ﬁnds, however, will bedependent on the (arbitrary) initial values chosen for w and b as well as on theorder in which the data points are presented. If there are multiple solutions all ofwhich classify the training data set exactly, then we should try to ﬁnd the one thatwill give the smallest generalization error. The support vector machine approachesthis problem through the concept of the margin, which is deﬁned to be the smallestdistance between the decision boundary and any of the samples, as illustrated inFigure 7.1.In support vector machines the decision boundary is chosen to be the one forwhich the margin is maximized.

The maximum margin solution can be motivated using computational learning theory, also known as statistical learning theory. However, a simple insight into the origins of maximum margin has been given by Tongand Koller (2000) who consider a framework for classiﬁcation based on a hybrid ofgenerative and discriminative approaches. They ﬁrst model the distribution over input vectors x for each class using a Parzen density estimator with Gaussian kernels7.1.

Maximum Margin Classiﬁers327y = −1y=1y=0y = −1y=0y=1marginFigure 7.1 The margin is deﬁned as the perpendicular distance between the decision boundary and the closestof the data points, as shown on the left ﬁgure. Maximizing the margin leads to a particular choice of decisionboundary, as shown on the right.

The location of this boundary is determined by a subset of the data points,known as support vectors, which are indicated by the circles.having a common parameter σ 2 . Together with the class priors, this deﬁnes an optimal misclassiﬁcation-rate decision boundary. However, instead of using this optimalboundary, they determine the best hyperplane by minimizing the probability of errorrelative to the learned density model.

In the limit σ 2 → 0, the optimal hyperplaneis shown to be the one having maximum margin. The intuition behind this result isthat as σ 2 is reduced, the hyperplane is increasingly dominated by nearby data pointsrelative to more distant ones. In the limit, the hyperplane becomes independent ofdata points that are not support vectors.We shall see in Figure 10.13 that marginalization with respect to the prior distribution of the parameters in a Bayesian approach for a simple linearly separable dataset leads to a decision boundary that lies in the middle of the region separating thedata points.

The large margin solution has similar behaviour.Recall from Figure 4.1 that the perpendicular distance of a point x from a hyperplane deﬁned by y(x) = 0 where y(x) takes the form (7.1) is given by |y(x)|/w.Furthermore, we are only interested in solutions for which all data points are correctly classiﬁed, so that tn y(xn ) > 0 for all n. Thus the distance of a point xn to thedecision surface is given bytn y(xn )tn (wT φ(xn ) + b)=.ww(7.2)The margin is given by the perpendicular distance to the closest point xn from thedata set, and we wish to optimize the parameters w and b in order to maximize thisdistance. Thus the maximum margin solution is found by solving 1arg max(7.3)min tn wT φ(xn ) + bw nw,bwhere we have taken the factor 1/w outside the optimization over n because w3287.

SPARSE KERNEL MACHINESdoes not depend on n. Direct solution of this optimization problem would be verycomplex, and so we shall convert it into an equivalent problem that is much easierto solve. To do this we note that if we make the rescaling w → κw and b → κb,then the distance from any point xn to the decision surface, given by tn y(xn )/w,is unchanged.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.