Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 80

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 80 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 802020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 80)

This still considers any possible choice for p(x, t), and so although the bounds are tighter, theyare still very conservative.7.2. Relevance Vector MachinesSupport vector machines have been used in a variety of classiﬁcation and regression applications. Nevertheless, they suffer from a number of limitations, severalof which have been highlighted already in this chapter. In particular, the outputs ofan SVM represent decisions rather than posterior probabilities.

Also, the SVM wasoriginally formulated for two classes, and the extension to K > 2 classes is problematic. There is a complexity parameter C, or ν (as well as a parameter in the caseof regression), that must be found using a hold-out method such as cross-validation.Finally, predictions are expressed as linear combinations of kernel functions that arecentred on training data points and that are required to be positive deﬁnite.The relevance vector machine or RVM (Tipping, 2001) is a Bayesian sparse kernel technique for regression and classiﬁcation that shares many of the characteristicsof the SVM whilst avoiding its principal limitations.

Additionally, it typically leadsto much sparser models resulting in correspondingly faster performance on test datawhilst maintaining comparable generalization error.In contrast to the SVM we shall ﬁnd it more convenient to introduce the regression form of the RVM ﬁrst and then consider the extension to classiﬁcation tasks.7.2.1 RVM for regressionThe relevance vector machine for regression is a linear model of the form studiedin Chapter 3 but with a modiﬁed prior that results in sparse solutions.

The modeldeﬁnes a conditional distribution for a real-valued target variable t, given an inputvector x, which takes the formp(t|x, w, β) = N (t|y(x), β −1 )(7.76)3467. SPARSE KERNEL MACHINESwhere β = σ −2 is the noise precision (inverse noise variance), and the mean is givenby a linear model of the formy(x) =Mwi φi (x) = wT φ(x)(7.77)i=1with ﬁxed nonlinear basis functions φi (x), which will typically include a constantterm so that the corresponding weight parameter represents a ‘bias’.The relevance vector machine is a speciﬁc instance of this model, which is intended to mirror the structure of the support vector machine.

In particular, the basisfunctions are given by kernels, with one kernel associated with each of the datapoints from the training set. The general expression (7.77) then takes the SVM-likeformNy(x) =wn k(x, xn ) + b(7.78)n=1where b is a bias parameter. The number of parameters in this case is M = N + 1,and y(x) has the same form as the predictive model (7.64) for the SVM, except thatthe coefﬁcients an are here denoted wn . It should be emphasized that the subsequentanalysis is valid for arbitrary choices of basis function, and for generality we shallwork with the form (7.77). In contrast to the SVM, there is no restriction to positivedeﬁnite kernels, nor are the basis functions tied in either number or location to thetraining data points.Suppose we are given a set of N observations of the input vector x, which wedenote collectively by a data matrix X whose nth row is xTn with n = 1, .

. . , N . Thecorresponding target values are given by t = (t1 , . . . , tN )T . Thus, the likelihoodfunction is given byp(t|X, w, β) =Np(tn |xn , w, β −1 ).(7.79)n=1Next we introduce a prior distribution over the parameter vector w and as inChapter 3, we shall consider a zero-mean Gaussian prior. However, the key difference in the RVM is that we introduce a separate hyperparameter αi for each of theweight parameters wi instead of a single shared hyperparameter. Thus the weightprior takes the formMN (wi |0, αi−1 )(7.80)p(w|α) =i=1where αi represents the precision of the corresponding parameter wi , and α denotes(α1 , .

. . , αM )T . We shall see that, when we maximize the evidence with respectto these hyperparameters, a signiﬁcant proportion of them go to inﬁnity, and thecorresponding weight parameters have posterior distributions that are concentratedat zero. The basis functions associated with these parameters therefore play no role7.2. Relevance Vector Machines347in the predictions made by the model and so are effectively pruned out, resulting ina sparse model.Using the result (3.49) for linear regression models, we see that the posteriordistribution for the weights is again Gaussian and takes the formp(w|t, X, α, β) = N (w|m, Σ)(7.81)where the mean and covariance are given bym = βΣΦT t−1Σ = A + βΦT ΦSection 3.5Exercise 7.10(7.82)(7.83)where Φ is the N × M design matrix with elements Φni = φi (xn ), and A =diag(αi ).

Note that in the speciﬁc case of the model (7.78), we have Φ = K, whereK is the symmetric (N + 1) × (N + 1) kernel matrix with elements k(xn , xm ).The values of α and β are determined using type-2 maximum likelihood, alsoknown as the evidence approximation, in which we maximize the marginal likelihood function obtained by integrating out the weight parameters(7.84)p(t|X, α, β) = p(t|X, w, β)p(w|α) dw.Because this represents the convolution of two Gaussians, it is readily evaluated togive the log marginal likelihood in the formln p(t|X, α, β) = ln N (t|0, C)1N ln(2π) + ln |C| + tT C−1 t= −2(7.85)where t = (t1 , .

. . , tN )T , and we have deﬁned the N × N matrix C given byC = β −1 I + ΦA−1 ΦT .Exercise 7.12Section 3.5.3(7.86)Our goal is now to maximize (7.85) with respect to the hyperparameters α andβ. This requires only a small modiﬁcation to the results obtained in Section 3.5 forthe evidence approximation in the linear regression model. Again, we can identifytwo approaches. In the ﬁrst, we simply set the required derivatives of the marginallikelihood to zero and obtain the following re-estimation equationsαinew=(β new )−1=γim2it − Φm2N − i γi(7.87)(7.88)where mi is the ith component of the posterior mean m deﬁned by (7.82).

Thequantity γi measures how well the corresponding parameter wi is determined by thedata and is deﬁned by3487. SPARSE KERNEL MACHINESγi = 1 − αi Σii(7.89)in which Σii is the i diagonal component of the posterior covariance Σ given by(7.83). Learning therefore proceeds by choosing initial values for α and β, evaluating the mean and covariance of the posterior using (7.82) and (7.83), respectively,and then alternately re-estimating the hyperparameters, using (7.87) and (7.88), andre-estimating the posterior mean and covariance, using (7.82) and (7.83), until a suitable convergence criterion is satisﬁed.The second approach is to use the EM algorithm, and is discussed in Section 9.3.4.

These two approaches to ﬁnding the values of the hyperparameters thatmaximize the evidence are formally equivalent. Numerically, however, it is foundthat the direct optimization approach corresponding to (7.87) and (7.88) gives somewhat faster convergence (Tipping, 2001).As a result of the optimization, we ﬁnd that a proportion of the hyperparameters{αi } are driven to large (in principle inﬁnite) values, and so the weight parameterswi corresponding to these hyperparameters have posterior distributions with meanand variance both zero. Thus those parameters, and the corresponding basis functions φi (x), are removed from the model and play no role in making predictions fornew inputs.

In the case of models of the form (7.78), the inputs xn corresponding tothe remaining nonzero weights are called relevance vectors, because they are identiﬁed through the mechanism of automatic relevance determination, and are analogous to the support vectors of an SVM. It is worth emphasizing, however, that thismechanism for achieving sparsity in probabilistic models through automatic relevance determination is quite general and can be applied to any model expressed asan adaptive linear combination of basis functions.Having found values α and β for the hyperparameters that maximize themarginal likelihood, we can evaluate the predictive distribution over t for a newinput x.

Using (7.76) and (7.81), this is given byp(t|x, w, β )p(w|X, t, α , β ) dwp(t|x, X, t, α , β ) == N t|mT φ(x), σ 2 (x) .(7.90)thExercise 9.23Section 7.2.2Exercise 7.14Thus the predictive mean is given by (7.76) with w set equal to the posterior meanm, and the variance of the predictive distribution is given byσ 2 (x) = (β )−1 + φ(x)T Σφ(x)Section 6.4.2(7.91)where Σ is given by (7.83) in which α and β are set to their optimized values α andβ .

This is just the familiar result (3.59) obtained in the context of linear regression.Recall that for localized basis functions, the predictive variance for linear regressionmodels becomes small in regions of input space where there are no basis functions.In the case of an RVM with the basis functions centred on data points, the model willtherefore become increasingly certain of its predictions when extrapolating outsidethe domain of the data (Rasmussen and Quiñonero-Candela, 2005), which of courseis undesirable.

The predictive distribution in Gaussian process regression does not3497.2. Relevance Vector MachinesFigure 7.9Illustration of RVM regression using the same data set, and thesame Gaussian kernel functions,as used in Figure 7.8 for theν-SVM regression model. Themean of the predictive distribution for the RVM is shown by thered line, and the one standarddeviation predictive distribution isshown by the shaded region.Also, the data points are shownin green, and the relevance vectors are indicated by blue circles.Note that there are only 3 relevance vectors compared to 7 support vectors for the ν-SVM in Figure 7.8.1t0−10x1suffer from this problem.

However, the computational cost of making predictionswith a Gaussian processes is typically much higher than with an RVM.Figure 7.9 shows an example of the RVM applied to the sinusoidal regressiondata set. Here the noise precision parameter β is also determined through evidencemaximization. We see that the number of relevance vectors in the RVM is significantly smaller than the number of support vectors used by the SVM. For a widerange of regression and classiﬁcation tasks, the RVM is found to give models thatare typically an order of magnitude more compact than the corresponding supportvector machine, resulting in a signiﬁcant improvement in the speed of processing ontest data. Remarkably, this greater sparsity is achieved with little or no reduction ingeneralization error compared with the corresponding SVM.The principal disadvantage of the RVM compared to the SVM is that traininginvolves optimizing a nonconvex function, and training times can be longer than for acomparable SVM.

For a model with M basis functions, the RVM requires inversionof a matrix of size M × M , which in general requires O(M 3 ) computation. In thespeciﬁc case of the SVM-like model (7.78), we have M = N +1. As we have noted,there are techniques for training SVMs whose cost is roughly quadratic in N .

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.