Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 72

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 72 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 722020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 72)

. . , θ3 in Figure 6.5, and Figure 6.6 shows a set of points sampled from the joint distribution (6.60) along with the corresponding values deﬁnedby (6.61).So far, we have used the Gaussian process viewpoint to build a model of thejoint distribution over sets of data points. Our goal in regression, however, is tomake predictions of the target variables for new inputs, given a set of training data.Let us suppose that tN = (t1 , .

. . , tN )T , corresponding to input values x1 , . . . , xN ,comprise the observed training set, and our goal is to predict the target variable tN +1for a new input vector xN +1 . This requires that we evaluate the predictive distribution p(tN +1 |tN ). Note that this distribution is conditioned also on the variablesx1 , .

. . , xN and xN +1 . However, to keep the notation simple we will not show theseconditioning variables explicitly.To ﬁnd the conditional distribution p(tN +1 |t), we begin by writing down thejoint distribution p(tN +1 ), where tN +1 denotes the vector (t1 , . . . , tN , tN +1 )T . Wethen apply the results from Section 2.3.1 to obtain the required conditional distribution, as illustrated in Figure 6.7.From (6.61), the joint distribution over t1 , .

. . , tN +1 will be given byp(tN +1 ) = N (tN +1 |0, CN +1 )(6.64)where CN +1 is an (N + 1) × (N + 1) covariance matrix with elements given by(6.62). Because this joint distribution is Gaussian, we can apply the results fromSection 2.3.1 to ﬁnd the conditional Gaussian distribution. To do this, we partitionthe covariance matrix as followsCN k(6.65)CN +1 =kT cwhere CN is the N × N covariance matrix with elements given by (6.62) for n, m =1, .

. . , N , the vector k has elements k(xn , xN +1 ) for n = 1, . . . , N , and the scalar3086. KERNEL METHODS(1.00, 4.00, 0.00, 0.00)(9.00, 4.00, 0.00, 0.00)(1.00, 64.00, 0.00, 0.00)3931.54.51.5000−1.5−4.5−1.5−3−1−0.500.51−9−1(1.00, 0.25, 0.00, 0.00)−0.500.5−3−11941.54.52000−1.5−4.5−2−0.500.51−9−1−0.500.500.51(1.00, 4.00, 0.00, 5.00)(1.00, 4.00, 10.00, 0.00)3−3−1−0.5−4−11−0.500.51Figure 6.5 Samples from a Gaussian process prior deﬁned by the covariance function (6.63). The title aboveeach plot denotes (θ0 , θ1 , θ2 , θ3 ).c = k(xN +1 , xN +1 ) + β −1 .

Using the results (2.81) and (2.82), we see that the conditional distribution p(tN +1 |t) is a Gaussian distribution with mean and covariancegiven by1m(xN +1 ) = kT C−N t(6.66)σ (xN +1 ) = c − k(6.67)2T1C−N k.These are the key results that deﬁne Gaussian process regression. Because the vectork is a function of the test point input value xN +1 , we see that the predictive distribution is a Gaussian whose mean and variance both depend on xN +1 . An example ofGaussian process regression is shown in Figure 6.8.The only restriction on the kernel function is that the covariance matrix given by(6.62) must be positive deﬁnite.

If λi is an eigenvalue of K, then the correspondingeigenvalue of C will be λi + β −1 . It is therefore sufﬁcient that the kernel matrixk(xn , xm ) be positive semideﬁnite for any pair of points xn and xm , so that λi 0,because any eigenvalue λi that is zero will still give rise to a positive eigenvaluefor C because β > 0. This is the same restriction on the kernel function discussedearlier, and so we can again exploit all of the techniques in Section 6.2 to construct3096.4. Gaussian ProcessesFigure 6.6Illustration of the sampling of datapoints {tn } from a Gaussian process.The blue curve shows a sample function from the Gaussian process priorover functions, and the red pointsshow the values of yn obtained byevaluating the function at a set of input values {xn }.

The corresponding values of {tn }, shown in green,are obtained by adding independentGaussian noise to each of the {yn }.3t0−3−10x1suitable kernels.Note that the mean (6.66) of the predictive distribution can be written, as a function of xN +1 , in the formm(xN +1 ) =Nan k(xn , xN +1 )(6.68)n=1Exercise 6.211where an is the nth component of C−N t.

Thus, if the kernel function k(xn , xm )depends only on the distance xn − xm , then we obtain an expansion in radialbasis functions.The results (6.66) and (6.67) deﬁne the predictive distribution for Gaussian process regression with an arbitrary kernel function k(xn , xm ). In the particular case inwhich the kernel function k(x, x ) is deﬁned in terms of a ﬁnite set of basis functions,we can derive the results obtained previously in Section 3.3.2 for linear regressionstarting from the Gaussian process viewpoint.For such models, we can therefore obtain the predictive distribution either bytaking a parameter space viewpoint and using the linear regression result or by takinga function space viewpoint and using the Gaussian process result.The central computational operation in using Gaussian processes will involvethe inversion of a matrix of size N × N , for which standard methods require O(N 3 )computations. By contrast, in the basis function model we have to invert a matrixSN of size M × M , which has O(M 3 ) computational complexity.

Note that forboth viewpoints, the matrix inversion must be performed once for the given trainingset. For each new test point, both methods require a vector-matrix multiply, whichhas cost O(N 2 ) in the Gaussian process case and O(M 2 ) for the linear basis function model. If the number M of basis functions is smaller than the number N ofdata points, it will be computationally more efﬁcient to work in the basis function3106. KERNEL METHODSFigure 6.7Illustration of the mechanism ofGaussian process regression forthe case of one training point andone test point, in which the red ellipses show contours of the joint distribution p(t1 , t2 ). Here t1 is thetraining data point, and conditioning on the value of t1 , corresponding to the vertical blue line, we obtain p(t2 |t1 ) shown as a function oft2 by the green curve.t21m(x2 )0t1−1−1Exercise 6.23Figure 6.801framework.

However, an advantage of a Gaussian processes viewpoint is that wecan consider covariance functions that can only be expressed in terms of an inﬁnitenumber of basis functions.For large training data sets, however, the direct application of Gaussian processmethods can become infeasible, and so a range of approximation schemes have beendeveloped that have better scaling with training set size than the exact approach(Gibbs, 1997; Tresp, 2001; Smola and Bartlett, 2001; Williams and Seeger, 2001;Csató and Opper, 2002; Seeger et al., 2003). Practical issues in the application ofGaussian processes are discussed in Bishop and Nabney (2008).We have introduced Gaussian process regression for the case of a single target variable. The extension of this formalism to multiple target variables, knownas co-kriging (Cressie, 1993), is straightforward.

Various other extensions of GausIllustration of Gaussian process regression applied to the sinusoidal1data set in Figure A.6 in which thethree right-most data points havebeen omitted. The green curve 0.5shows the sinusoidal function fromwhich the data points, shown in0blue, are obtained by sampling andaddition of Gaussian noise. Thered line shows the mean of the −0.5Gaussian process predictive distribution, and the shaded region cor−1responds to plus and minus twostandard deviations. Notice howthe uncertainty increases in the re0gion to the right of the data points.0.20.40.60.816.4. Gaussian Processes311sian process regression have also been considered, for purposes such as modellingthe distribution over low-dimensional manifolds for unsupervised learning (Bishopet al., 1998a) and the solution of stochastic differential equations (Graepel, 2003).6.4.3 Learning the hyperparametersSection 3.5The predictions of a Gaussian process model will depend, in part, on the choiceof covariance function.

In practice, rather than ﬁxing the covariance function, wemay prefer to use a parametric family of functions and then infer the parametervalues from the data. These parameters govern such things as the length scale of thecorrelations and the precision of the noise and correspond to the hyperparameters ina standard parametric model.Techniques for learning the hyperparameters are based on the evaluation of thelikelihood function p(t|θ) where θ denotes the hyperparameters of the Gaussian process model. The simplest approach is to make a point estimate of θ by maximizingthe log likelihood function. Because θ represents a set of hyperparameters for theregression problem, this can be viewed as analogous to the type 2 maximum likelihood procedure for linear regression models.

Maximization of the log likelihoodcan be done using efﬁcient gradient-based optimization algorithms such as conjugategradients (Fletcher, 1987; Nocedal and Wright, 1999; Bishop and Nabney, 2008).The log likelihood function for a Gaussian process regression model is easilyevaluated using the standard form for a multivariate Gaussian distribution, giving1N11ln(2π).ln p(t|θ) = − ln |CN | − tT C−N t−222(6.69)For nonlinear optimization, we also need the gradient of the log likelihood function with respect to the parameter vector θ.

We shall assume that evaluation of thederivatives of CN is straightforward, as would be the case for the covariance functions considered in this chapter. Making use of the result (C.21) for the derivative of1C−N , together with the result (C.22) for the derivative of ln |CN |, we obtain11∂−1 ∂CN1 ∂CNln p(t|θ) = − Tr CN+ tT C−C−1 t.(6.70)N∂θi2∂θi2∂θi NBecause ln p(t|θ) will in general be a nonconvex function, it can have multiple maxima.It is straightforward to introduce a prior over θ and to maximize the log posterior using gradient-based methods.

In a fully Bayesian treatment, we need to evaluatemarginals over θ weighted by the product of the prior p(θ) and the likelihood function p(t|θ). In general, however, exact marginalization will be intractable, and wemust resort to approximations.The Gaussian process regression model gives a predictive distribution whosemean and variance are functions of the input vector x.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.