Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 73

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 73 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 732020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 73)

However, we have assumedthat the contribution to the predictive variance arising from the additive noise, governed by the parameter β, is a constant. For some problems, known as heteroscedastic, the noise variance itself will also depend on x. To model this, we can extend the3126. KERNEL METHODSFigure 6.9 Samples from the ARDprior for Gaussian processes, inwhich the kernel function is given by(6.71).

The left plot corresponds toη1 = η2 = 1, and the right plot corresponds to η1 = 1, η2 = 0.01.Gaussian process framework by introducing a second Gaussian process to representthe dependence of β on the input x (Goldberg et al., 1998). Because β is a variance,and hence nonnegative, we use the Gaussian process to model ln β(x).6.4.4 Automatic relevance determinationIn the previous section, we saw how maximum likelihood could be used to determine a value for the correlation length-scale parameter in a Gaussian process.This technique can usefully be extended by incorporating a separate parameter foreach input variable (Rasmussen and Williams, 2006).

The result, as we shall see, isthat the optimization of these parameters by maximum likelihood allows the relativeimportance of different inputs to be inferred from the data. This represents an example in the Gaussian process context of automatic relevance determination, or ARD,which was originally formulated in the framework of neural networks (MacKay,1994; Neal, 1996). The mechanism by which appropriate inputs are preferred isdiscussed in Section 7.2.2.Consider a Gaussian process with a two-dimensional input space x = (x1 , x2 ),having a kernel function of the form21k(x, x ) = θ0 exp −ηi (xi − xi )2 .(6.71)2i=1Samples from the resulting prior over functions y(x) are shown for two differentsettings of the precision parameters ηi in Figure 6.9.

We see that, as a particular parameter ηi becomes small, the function becomes relatively insensitive to thecorresponding input variable xi . By adapting these parameters to a data set usingmaximum likelihood, it becomes possible to detect input variables that have littleeffect on the predictive distribution, because the corresponding values of ηi will besmall. This can be useful in practice because it allows such inputs to be discarded.ARD is illustrated using a simple synthetic data set having three inputs x1 , x2 and x3(Nabney, 2002) in Figure 6.10.

The target variable t, is generated by sampling 100values of x1 from a Gaussian, evaluating the function sin(2πx1 ), and then adding6.4. Gaussian ProcessesFigure 6.10Illustration of automatic relevance determination in a Gaussian process for a synthetic problem having three inputs x1 , x2 ,and x3 , for which the curvesshow the corresponding values ofthe hyperparameters η1 (red), η2(green), and η3 (blue) as a function of the number of iterationswhen optimizing the marginallikelihood.

Details are given inthe text. Note the logarithmicscale on the vertical axis.313210010−210−410020406080100Gaussian noise. Values of x2 are given by copying the corresponding values of x1and adding noise, and values of x3 are sampled from an independent Gaussian distribution. Thus x1 is a good predictor of t, x2 is a more noisy predictor of t, and x3has only chance correlations with t. The marginal likelihood for a Gaussian processwith ARD parameters η1 , η2 , η3 is optimized using the scaled conjugate gradientsalgorithm. We see from Figure 6.10 that η1 converges to a relatively large value, η2converges to a much smaller value, and η3 becomes very small indicating that x3 isirrelevant for predicting t.The ARD framework is easily incorporated into the exponential-quadratic kernel(6.63) to give the following form of kernel function, which has been found useful forapplications of Gaussian processes to a range of regression problemsDD12ηi (xni − xmi ) + θ2 + θ3xni xmi (6.72)k(xn , xm ) = θ0 exp −2i=1i=1where D is the dimensionality of the input space.6.4.5 Gaussian processes for classiﬁcationIn a probabilistic approach to classiﬁcation, our goal is to model the posteriorprobabilities of the target variable for a new input vector, given a set of trainingdata.

These probabilities must lie in the interval (0, 1), whereas a Gaussian processmodel makes predictions that lie on the entire real axis. However, we can easilyadapt Gaussian processes to classiﬁcation problems by transforming the output ofthe Gaussian process using an appropriate nonlinear activation function.Consider ﬁrst the two-class problem with a target variable t ∈ {0, 1}. If we deﬁne a Gaussian process over a function a(x) and then transform the function usinga logistic sigmoid y = σ(a), given by (4.59), then we will obtain a non-Gaussianstochastic process over functions y(x) where y ∈ (0, 1).

This is illustrated for thecase of a one-dimensional input space in Figure 6.11 in which the probability distri-3146. KERNEL METHODS10150.7500.5−50.25−10−1−0.500.510−1−0.500.51Figure 6.11 The left plot shows a sample from a Gaussian process prior over functions a(x), and the right plotshows the result of transforming this sample using a logistic sigmoid function.bution over the target variable t is then given by the Bernoulli distributionp(t|a) = σ(a)t (1 − σ(a))1−t .(6.73)As usual, we denote the training set inputs by x1 , . . . , xN with correspondingobserved target variables t = (t1 , .

. . , tN )T . We also consider a single test pointxN +1 with target value tN +1 . Our goal is to determine the predictive distributionp(tN +1 |t), where we have left the conditioning on the input variables implicit. To dothis we introduce a Gaussian process prior over the vector aN +1 , which has components a(x1 ), . . . , a(xN +1 ). This in turn deﬁnes a non-Gaussian process over tN +1 ,and by conditioning on the training data tN we obtain the required predictive distribution. The Gaussian process prior for aN +1 takes the formp(aN +1 ) = N (aN +1 |0, CN +1 ).(6.74)Unlike the regression case, the covariance matrix no longer includes a noise termbecause we assume that all of the training data points are correctly labelled.

However, for numerical reasons it is convenient to introduce a noise-like term governedby a parameter ν that ensures that the covariance matrix is positive deﬁnite. Thusthe covariance matrix CN +1 has elements given byC(xn , xm ) = k(xn , xm ) + νδnm(6.75)where k(xn , xm ) is any positive semideﬁnite kernel function of the kind consideredin Section 6.2, and the value of ν is typically ﬁxed in advance. We shall assume thatthe kernel function k(x, x ) is governed by a vector θ of parameters, and we shalllater discuss how θ may be learned from the training data.For two-class problems, it is sufﬁcient to predict p(tN +1 = 1|tN ) because thevalue of p(tN +1 = 0|tN ) is then given by 1 − p(tN +1 = 1|tN ). The required6.4.

Gaussian Processespredictive distribution is given byp(tN +1 = 1|tN ) = p(tN +1 = 1|aN +1 )p(aN +1 |tN ) daN +1Section 2.3Section 10.1Section 10.7315(6.76)where p(tN +1 = 1|aN +1 ) = σ(aN +1 ).This integral is analytically intractable, and so may be approximated using sampling methods (Neal, 1997). Alternatively, we can consider techniques based onan analytical approximation. In Section 4.5.2, we derived the approximate formula(4.153) for the convolution of a logistic sigmoid with a Gaussian distribution.

Wecan use this result to evaluate the integral in (6.76) provided we have a Gaussianapproximation to the posterior distribution p(aN +1 |tN ). The usual justiﬁcation for aGaussian approximation to a posterior distribution is that the true posterior will tendto a Gaussian as the number of data points increases as a consequence of the centrallimit theorem. In the case of Gaussian processes, the number of variables grows withthe number of data points, and so this argument does not apply directly. However, ifwe consider increasing the number of data points falling in a ﬁxed region of x space,then the corresponding uncertainty in the function a(x) will decrease, again leadingasymptotically to a Gaussian (Williams and Barber, 1998).Three different approaches to obtaining a Gaussian approximation have beenconsidered.

One technique is based on variational inference (Gibbs and MacKay,2000) and makes use of the local variational bound (10.144) on the logistic sigmoid.This allows the product of sigmoid functions to be approximated by a product ofGaussians thereby allowing the marginalization over aN to be performed analytically. The approach also yields a lower bound on the likelihood function p(tN |θ).The variational framework for Gaussian process classiﬁcation can also be extendedto multiclass (K > 2) problems by using a Gaussian approximation to the softmaxfunction (Gibbs, 1997).A second approach uses expectation propagation (Opper and Winther, 2000b;Minka, 2001b; Seeger, 2003).

Because the true posterior distribution is unimodal, aswe shall see shortly, the expectation propagation approach can give good results.6.4.6 Laplace approximationSection 4.4The third approach to Gaussian process classiﬁcation is based on the Laplaceapproximation, which we now consider in detail. In order to evaluate the predictivedistribution (6.76), we seek a Gaussian approximation to the posterior distributionover aN +1 , which, using Bayes’ theorem, is given byp(aN +1 |tN ) =p(aN +1 , aN |tN ) daN1=p(aN +1 , aN )p(tN |aN +1 , aN ) daNp(tN )1=p(aN +1 |aN )p(aN )p(tN |aN ) daNp(tN )=p(aN +1 |aN )p(aN |tN ) daN(6.77)3166.

KERNEL METHODSwhere we have used p(tN |aN +1 , aN ) = p(tN |aN ). The conditional distributionp(aN +1 |aN ) is obtained by invoking the results (6.66) and (6.67) for Gaussian process regression, to give1T −1p(aN +1 |aN ) = N (aN +1 |kT C−N aN , c − k CN k).(6.78)We can therefore evaluate the integral in (6.77) by ﬁnding a Laplace approximationfor the posterior distribution p(aN |tN ), and then using the standard result for theconvolution of two Gaussian distributions.The prior p(aN ) is given by a zero-mean Gaussian process with covariance matrix CN , and the data term (assuming independence of the data points) is given byp(tN |aN ) =Ntn1−tnσ(an ) (1 − σ(an ))n=1=Nean tn σ(−an ).(6.79)n=1We then obtain the Laplace approximation by Taylor expanding the logarithm ofp(aN |tN ), which up to an additive normalization constant is given by the quantityΨ(aN ) = ln p(aN ) + ln p(tN |aN )1N1ln(2π) − ln |CN | + tTC−1 aN −= − aTN aN2 N N22N−ln(1 + ean ) + const.(6.80)n=1First we need to ﬁnd the mode of the posterior distribution, and this requires that weevaluate the gradient of Ψ(aN ), which is given by1∇Ψ(aN ) = tN − σ N − C−N aNSection 4.3.3where σ N is a vector with elements σ(an ).

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.