Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 71

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 71 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 712020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 71)

The blue ellipse aroundeach data point shows one standard deviation contour for −0.5the corresponding kernel. These appear noncircular due−1to the different scales on the horizontal and vertical axes.−1.500.20.40.60.81In fact, this model deﬁnes not only a conditional expectation but also a fullconditional distribution given byf (x − xn , t − tn )p(t, x)n= (6.48)p(t|x) = p(t, x) dtf (x − xm , t − tm ) dtmExercise 6.18from which other expectations can be evaluated.As an illustration we consider the case of a single input variable x in whichf (x, t) is given by a zero-mean isotropic Gaussian over the variable z = (x, t) withvariance σ 2 .

The corresponding conditional distribution (6.48) is given by a Gaussian mixture, and is shown, together with the conditional mean, for the sinusoidalsynthetic data set in Figure 6.3.An obvious extension of this model is to allow for more ﬂexible forms of Gaussian components, for instance having different variance parameters for the input andtarget variables. More generally, we could model the joint distribution p(t, x) usinga Gaussian mixture model, trained using techniques discussed in Chapter 9 (Ghahramani and Jordan, 1994), and then ﬁnd the corresponding conditional distributionp(t|x). In this latter case we no longer have a representation in terms of kernel functions evaluated at the training set data points.

However, the number of componentsin the mixture model can be smaller than the number of training set points, resultingin a model that is faster to evaluate for test data points. We have thereby accepted anincreased computational cost during the training phase in order to have a model thatis faster at making predictions.6.4. Gaussian ProcessesIn Section 6.1, we introduced kernels by applying the concept of duality to a nonprobabilistic model for regression. Here we extend the role of kernels to probabilis-3046.

KERNEL METHODStic discriminative models, leading to the framework of Gaussian processes. We shallthereby see how kernels arise naturally in a Bayesian setting.In Chapter 3, we considered linear regression models of the form y(x, w) =wT φ(x) in which w is a vector of parameters and φ(x) is a vector of ﬁxed nonlinearbasis functions that depend on the input vector x. We showed that a prior distributionover w induced a corresponding prior distribution over functions y(x, w). Given atraining data set, we then evaluated the posterior distribution over w and therebyobtained the corresponding posterior distribution over regression functions, whichin turn (with the addition of noise) implies a predictive distribution p(t|x) for newinput vectors x.In the Gaussian process viewpoint, we dispense with the parametric model andinstead deﬁne a prior probability distribution over functions directly.

At ﬁrst sight, itmight seem difﬁcult to work with a distribution over the uncountably inﬁnite space offunctions. However, as we shall see, for a ﬁnite training set we only need to considerthe values of the function at the discrete set of input values xn corresponding to thetraining set and test set data points, and so in practice we can work in a ﬁnite space.Models equivalent to Gaussian processes have been widely studied in many different ﬁelds.

For instance, in the geostatistics literature Gaussian process regressionis known as kriging (Cressie, 1993). Similarly, ARMA (autoregressive moving average) models, Kalman ﬁlters, and radial basis function networks can all be viewed asforms of Gaussian process models. Reviews of Gaussian processes from a machinelearning perspective can be found in MacKay (1998), Williams (1999), and MacKay(2003), and a comparison of Gaussian process models with alternative approaches isgiven in Rasmussen (1996).

See also Rasmussen and Williams (2006) for a recenttextbook on Gaussian processes.6.4.1 Linear regression revisitedIn order to motivate the Gaussian process viewpoint, let us return to the linearregression example and re-derive the predictive distribution by working in termsof distributions over functions y(x, w). This will provide a speciﬁc example of aGaussian process.Consider a model deﬁned in terms of a linear combination of M ﬁxed basisfunctions given by the elements of the vector φ(x) so thaty(x) = wT φ(x)(6.49)where x is the input vector and w is the M -dimensional weight vector.

Now considera prior distribution over w given by an isotropic Gaussian of the formp(w) = N (w|0, α−1 I)(6.50)governed by the hyperparameter α, which represents the precision (inverse variance)of the distribution. For any given value of w, the deﬁnition (6.49) deﬁnes a particular function of x. The probability distribution over w deﬁned by (6.50) thereforeinduces a probability distribution over functions y(x). In practice, we wish to evaluate this function at speciﬁc values of x, for example at the training data points6.4.

Gaussian Processes305x1 , . . . , xN . We are therefore interested in the joint distribution of the function values y(x1 ), . . . , y(xN ), which we denote by the vector y with elements yn = y(xn )for n = 1, . . . , N . From (6.49), this vector is given byy = ΦwExercise 2.31(6.51)where Φ is the design matrix with elements Φnk = φk (xn ). We can ﬁnd the probability distribution of y as follows. First of all we note that y is a linear combination ofGaussian distributed variables given by the elements of w and hence is itself Gaussian. We therefore need only to ﬁnd its mean and covariance, which are given from(6.50) byE[y] = ΦE[w] = 0(6.52) T1(6.53)cov[y] = E yy = ΦE wwT ΦT = ΦΦT = Kαwhere K is the Gram matrix with elements1Knm = k(xn , xm ) = φ(xn )T φ(xm )(6.54)αand k(x, x ) is the kernel function.This model provides us with a particular example of a Gaussian process.

In general, a Gaussian process is deﬁned as a probability distribution over functions y(x)such that the set of values of y(x) evaluated at an arbitrary set of points x1 , . . . , xNjointly have a Gaussian distribution. In cases where the input vector x is two dimensional, this may also be known as a Gaussian random ﬁeld. More generally, astochastic process y(x) is speciﬁed by giving the joint probability distribution forany ﬁnite set of values y(x1 ), .

. . , y(xN ) in a consistent manner.A key point about Gaussian stochastic processes is that the joint distributionover N variables y1 , . . . , yN is speciﬁed completely by the second-order statistics,namely the mean and the covariance. In most applications, we will not have anyprior knowledge about the mean of y(x) and so by symmetry we take it to be zero.This is equivalent to choosing the mean of the prior over weight values p(w|α) tobe zero in the basis function viewpoint. The speciﬁcation of the Gaussian process isthen completed by giving the covariance of y(x) evaluated at any two values of x,which is given by the kernel functionE [y(xn )y(xm )] = k(xn , xm ).(6.55)For the speciﬁc case of a Gaussian process deﬁned by the linear regression model(6.49) with a weight prior (6.50), the kernel function is given by (6.54).We can also deﬁne the kernel function directly, rather than indirectly through achoice of basis function.

Figure 6.4 shows samples of functions drawn from Gaussian processes for two different choices of kernel function. The ﬁrst of these is a‘Gaussian’ kernel of the form (6.23), and the second is the exponential kernel givenby(6.56)k(x, x ) = exp (−θ |x − x |)which corresponds to the Ornstein-Uhlenbeck process originally introduced by Uhlenbeck and Ornstein (1930) to describe Brownian motion.3066. KERNEL METHODSFigure 6.4 Samples from Gaussian processes for a ‘Gaussian’ kernel (left) and an exponential kernel(right).331.51.500−1.5−1.5−3−1−0.500.51−3−1−0.500.516.4.2 Gaussian processes for regressionIn order to apply Gaussian process models to the problem of regression, we needto take account of the noise on the observed target values, which are given byt n = y n + n(6.57)where yn = y(xn ), and n is a random noise variable whose value is chosen independently for each observation n.

Here we shall consider noise processes that havea Gaussian distribution, so thatp(tn |yn ) = N (tn |yn , β −1 )(6.58)where β is a hyperparameter representing the precision of the noise. Because thenoise is independent for each data point, the joint distribution of the target valuest = (t1 , . . . , tN )T conditioned on the values of y = (y1 , . . .

, yN )T is given by anisotropic Gaussian of the formp(t|y) = N (t|y, β −1 IN )(6.59)where IN denotes the N × N unit matrix. From the deﬁnition of a Gaussian process,the marginal distribution p(y) is given by a Gaussian whose mean is zero and whosecovariance is deﬁned by a Gram matrix K so thatp(y) = N (y|0, K).(6.60)The kernel function that determines K is typically chosen to express the propertythat, for points xn and xm that are similar, the corresponding values y(xn ) andy(xm ) will be more strongly correlated than for dissimilar points. Here the notionof similarity will depend on the application.In order to ﬁnd the marginal distribution p(t), conditioned on the input valuesx1 , .

. . , xN , we need to integrate over y. This can be done by making use of theresults from Section 2.3.3 for the linear-Gaussian model. Using (2.115), we see thatthe marginal distribution of t is given by(6.61)p(t) = p(t|y)p(y) dy = N (t|0, C)6.4. Gaussian Processes307where the covariance matrix C has elementsC(xn , xm ) = k(xn , xm ) + β −1 δnm .(6.62)This result reﬂects the fact that the two Gaussian sources of randomness, namelythat associated with y(x) and that associated with , are independent and so theircovariances simply add.One widely used kernel function for Gaussian process regression is given by theexponential of a quadratic form, with the addition of constant and linear terms togiveθ12(6.63)k(xn , xm ) = θ0 exp − xn − xm + θ2 + θ3 xTn xm .2Note that the term involving θ3 corresponds to a parametric model that is a linearfunction of the input variables. Samples from this prior are plotted for various valuesof the parameters θ0 , .

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.