Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 68

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 68 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 682020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 68)

By following an argument analogous to that of Section 5.5.5, show thatthe resulting regularizer reduces to the Tikhonov form (5.135).5.28 () www Consider a neural network, such as the convolutional network discussedin Section 5.5.6, in which multiple weights are constrained to have the same value.Discuss how the standard backpropagation algorithm must be modiﬁed in order toensure that such constraints are satisﬁed when evaluating the derivatives of an errorfunction with respect to the adjustable parameters in the network.5.29 () wwwVerify the result (5.141).5.30 () Verify the result (5.142).5.31 () Verify the result (5.143).5.32 ( ) Show that the derivatives of the mixing coefﬁcients {πk }, deﬁned by (5.146),with respect to the auxiliary parameters {ηj } are given by∂πk= δjk πj − πj πk .(5.208)∂ηjHence, by making use of the constraint k πk = 1, derive the result (5.147).5.33 () Write down a pair of equations that express the Cartesian coordinates (x1 , x2 )for the robot arm shown in Figure 5.18 in terms of the joint angles θ1 and θ2 andthe lengths L1 and L2 of the links.

Assume the origin of the coordinate system isgiven by the attachment point of the lower arm. These equations deﬁne the ‘forwardkinematics’ of the robot arm.5.34 () www Derive the result (5.155) for the derivative of the error function withrespect to the network output activations controlling the mixing coefﬁcients in themixture density network.5.35 () Derive the result (5.156) for the derivative of the error function with respectto the network output activations controlling the component means in the mixturedensity network.5.36 () Derive the result (5.157) for the derivative of the error function with respect tothe network output activations controlling the component variances in the mixturedensity network.5.37 () Verify the results (5.158) and (5.160) for the conditional mean and variance ofthe mixture density network model.5.38 () Using the general result (2.115), derive the predictive distribution (5.172) forthe Laplace approximation to the Bayesian neural network model.2905.

NEURAL NETWORKS5.39 () www Make use of the Laplace approximation result (4.135) to show that theevidence function for the hyperparameters α and β in the Bayesian neural networkmodel can be approximated by (5.175).5.40 () www Outline the modiﬁcations needed to the framework for Bayesian neuralnetworks, discussed in Section 5.7.3, to handle multiclass problems using networkshaving softmax output-unit activation functions.5.41 ( ) By following analogous steps to those given in Section 5.7.1 for regressionnetworks, derive the result (5.183) for the marginal likelihood in the case of a network having a cross-entropy error function and logistic-sigmoid output-unit activation function.6KernelMethodsChapter 5Section 2.5.1In Chapters 3 and 4, we considered linear parametric models for regression andclassiﬁcation in which the form of the mapping y(x, w) from input x to output yis governed by a vector w of adaptive parameters.

During the learning phase, aset of training data is used either to obtain a point estimate of the parameter vectoror to determine a posterior distribution over this vector. The training data is thendiscarded, and predictions for new inputs are based purely on the learned parametervector w. This approach is also used in nonlinear parametric models such as neuralnetworks.However, there is a class of pattern recognition techniques, in which the trainingdata points, or a subset of them, are kept and used also during the prediction phase.For instance, the Parzen probability density model comprised a linear combinationof ‘kernel’ functions each one centred on one of the training data points. Similarly,in Section 2.5.2 we introduced a simple technique for classiﬁcation called nearestneighbours, which involved assigning to each new test vector the same label as the2912926.

KERNEL METHODSclosest example from the training set. These are examples of memory-based methodsthat involve storing the entire training set in order to make predictions for future datapoints. They typically require a metric to be deﬁned that measures the similarity ofany two vectors in input space, and are generally fast to ‘train’ but slow at makingpredictions for test data points.Many linear parametric models can be re-cast into an equivalent ‘dual representation’ in which the predictions are also based on linear combinations of a kernelfunction evaluated at the training data points. As we shall see, for models which arebased on a ﬁxed nonlinear feature space mapping φ(x), the kernel function is givenby the relationk(x, x ) = φ(x)T φ(x ).(6.1)Chapter 7Section 12.3Section 6.3From this deﬁnition, we see that the kernel is a symmetric function of its argumentsso that k(x, x ) = k(x , x).

The kernel concept was introduced into the ﬁeld of pattern recognition by Aizerman et al. (1964) in the context of the method of potentialfunctions, so-called because of an analogy with electrostatics. Although neglectedfor many years, it was re-introduced into machine learning in the context of largemargin classiﬁers by Boser et al. (1992) giving rise to the technique of supportvector machines. Since then, there has been considerable interest in this topic, bothin terms of theory and applications.

One of the most signiﬁcant developments hasbeen the extension of kernels to handle symbolic objects, thereby greatly expandingthe range of problems that can be addressed.The simplest example of a kernel function is obtained by considering the identitymapping for the feature space in (6.1) so that φ(x) = x, in which case k(x, x ) =xT x . We shall refer to this as the linear kernel.The concept of a kernel formulated as an inner product in a feature space allowsus to build interesting extensions of many well-known algorithms by making use ofthe kernel trick, also known as kernel substitution. The general idea is that, if we havean algorithm formulated in such a way that the input vector x enters only in the formof scalar products, then we can replace that scalar product with some other choice ofkernel. For instance, the technique of kernel substitution can be applied to principalcomponent analysis in order to develop a nonlinear variant of PCA (Schölkopf et al.,1998).

Other examples of kernel substitution include nearest-neighbour classiﬁersand the kernel Fisher discriminant (Mika et al., 1999; Roth and Steinhage, 2000;Baudat and Anouar, 2000).There are numerous forms of kernel functions in common use, and we shall encounter several examples in this chapter. Many have the property of being a functiononly of the difference between the arguments, so that k(x, x ) = k(x − x ), whichare known as stationary kernels because they are invariant to translations in inputspace. A further specialization involves homogeneous kernels, also known as radial basis functions, which depend only on the magnitude of the distance (typicallyEuclidean) between the arguments so that k(x, x ) = k(x − x ).For recent textbooks on kernel methods, see Schölkopf and Smola (2002), Herbrich (2002), and Shawe-Taylor and Cristianini (2004).6.1.

Dual Representations2936.1. Dual RepresentationsMany linear models for regression and classiﬁcation can be reformulated in terms ofa dual representation in which the kernel function arises naturally. This concept willplay an important role when we consider support vector machines in the next chapter.Here we consider a linear regression model whose parameters are determined byminimizing a regularized sum-of-squares error function given by2 λ1 TJ(w) =w φ(xn ) − tn + wT w22N(6.2)n=1where λ 0. If we set the gradient of J(w) with respect to w equal to zero, we seethat the solution for w takes the form of a linear combination of the vectors φ(xn ),with coefﬁcients that are functions of w, of the formw=−1 Tw φ(xn ) − tn φ(xn ) =an φ(xn ) = ΦT aλNNn=1n=1(6.3)where Φ is the design matrix, whose nth row is given by φ(xn )T .

Here the vectora = (a1 , . . . , aN )T , and we have deﬁned1 Tw φ(xn ) − tn .(6.4)λInstead of working with the parameter vector w, we can now reformulate the leastsquares algorithm in terms of the parameter vector a, giving rise to a dual representation. If we substitute w = ΦT a into J(w), we obtainan = −J(a) =1 T1λa ΦΦT ΦΦT a − aT ΦΦT t + tT t + aT ΦΦT a222(6.5)where t = (t1 , .

. . , tN )T . We now deﬁne the Gram matrix K = ΦΦT , which is anN × N symmetric matrix with elementsKnm = φ(xn )T φ(xm ) = k(xn , xm )(6.6)where we have introduced the kernel function k(x, x ) deﬁned by (6.1). In terms ofthe Gram matrix, the sum-of-squares error function can be written asJ(a) =1λ1 Ta KKa − aT Kt + tT t + aT Ka.222(6.7)Setting the gradient of J(a) with respect to a to zero, we obtain the following solution−1a = (K + λIN ) t.(6.8)2946. KERNEL METHODSIf we substitute this back into the linear regression model, we obtain the followingprediction for a new input x−1y(x) = wT φ(x) = aT Φφ(x) = k(x)T (K + λIN )Exercise 6.1Exercise 6.2t(6.9)where we have deﬁned the vector k(x) with elements kn (x) = k(xn , x). Thus wesee that the dual formulation allows the solution to the least-squares problem to beexpressed entirely in terms of the kernel function k(x, x ).

This is known as a dualformulation because, by noting that the solution for a can be expressed as a linearcombination of the elements of φ(x), we recover the original formulation in terms ofthe parameter vector w. Note that the prediction at x is given by a linear combinationof the target values from the training set. In fact, we have already obtained this result,using a slightly different notation, in Section 3.3.3.In the dual formulation, we determine the parameter vector a by inverting anN × N matrix, whereas in the original parameter space formulation we had to invertan M × M matrix in order to determine w. Because N is typically much largerthan M , the dual formulation does not seem to be particularly useful. However, theadvantage of the dual formulation, as we shall see, is that it is expressed entirely interms of the kernel function k(x, x ). We can therefore work directly in terms ofkernels and avoid the explicit introduction of the feature vector φ(x), which allowsus implicitly to use feature spaces of high, even inﬁnite, dimensionality.The existence of a dual representation based on the Gram matrix is a property ofmany linear models, including the perceptron.

In Section 6.4, we will develop a duality between probabilistic linear models for regression and the technique of Gaussianprocesses. Duality will also play an important role when we discuss support vectormachines in Chapter 7.6.2. Constructing KernelsIn order to exploit kernel substitution, we need to be able to construct valid kernelfunctions. One approach is to choose a feature space mapping φ(x) and then usethis to ﬁnd the corresponding kernel, as is illustrated in Figure 6.1. Here the kernelfunction is deﬁned for a one-dimensional input space byk(x, x ) = φ(x)T φ(x ) =Mφi (x)φi (x )(6.10)i=1where φi (x) are the basis functions.An alternative approach is to construct kernel functions directly. In this case,we must ensure that the function we choose is a valid kernel, in other words that itcorresponds to a scalar product in some (perhaps inﬁnite dimensional) feature space.As a simple example, consider a kernel function given by2(6.11)k(x, z) = xT z .6.2.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.