Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 55

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 55 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 552020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 55)

This ﬁgure also shows how individual hidden units workcollaboratively to approximate the ﬁnal function. The role of hidden units in a simpleclassiﬁcation problem is illustrated in Figure 5.4 using the synthetic classiﬁcationdata set described in Appendix A.5.1.1 Weight-space symmetriesOne property of feed-forward networks, which will play a role when we considerBayesian model comparison, is that multiple distinct choices for the weight vectorw can all give rise to the same mapping function from inputs to outputs (Chen et al.,1993). Consider a two-layer network of the form shown in Figure 5.1 with M hiddenunits having ‘tanh’ activation functions and full connectivity in both layers.

If wechange the sign of all of the weights and the bias feeding into a particular hiddenunit, then, for a given input pattern, the sign of the activation of the hidden unit willbe reversed, because ‘tanh’ is an odd function, so that tanh(−a) = − tanh(a). Thistransformation can be exactly compensated by changing the sign of all of the weightsleading out of that hidden unit. Thus, by changing the signs of a particular group ofweights (and a bias), the input–output mapping function represented by the networkis unchanged, and so we have found two different weight vectors that give rise tothe same mapping function.

For M hidden units, there will be M such ‘sign-ﬂip’2325. NEURAL NETWORKSFigure 5.4Example of the solution of a simple two- 3class classiﬁcation problem involvingsynthetic data using a neural networkhaving two inputs, two hidden units with 2‘tanh’ activation functions, and a singleoutput having a logistic sigmoid activa- 1tion function. The dashed blue linesshow the z = 0.5 contours for each of 0the hidden units, and the red line showsthe y = 0.5 decision surface for the net- −1work. For comparison, the green linedenotes the optimal decision boundary −2computed from the distributions used togenerate the data.−2−1012symmetries, and thus any given weight vector will be one of a set 2M equivalentweight vectors .Similarly, imagine that we interchange the values of all of the weights (and thebias) leading both into and out of a particular hidden unit with the correspondingvalues of the weights (and bias) associated with a different hidden unit.

Again, thisclearly leaves the network input–output mapping function unchanged, but it corresponds to a different choice of weight vector. For M hidden units, any given weightvector will belong to a set of M ! equivalent weight vectors associated with this interchange symmetry, corresponding to the M ! different orderings of the hidden units.The network will therefore have an overall weight-space symmetry factor of M !2M .For networks with more than two layers of weights, the total level of symmetry willbe given by the product of such factors, one for each layer of hidden units.It turns out that these factors account for all of the symmetries in weight space(except for possible accidental symmetries due to speciﬁc choices for the weight values).

Furthermore, the existence of these symmetries is not a particular property ofthe ‘tanh’ function but applies to a wide range of activation functions (Ku̇rková andKainen, 1994). In many cases, these symmetries in weight space are of little practical consequence, although in Section 5.7 we shall encounter a situation in which weneed to take them into account.5.2. Network TrainingSo far, we have viewed neural networks as a general class of parametric nonlinearfunctions from a vector x of input variables to a vector y of output variables.

Asimple approach to the problem of determining the network parameters is to make ananalogy with the discussion of polynomial curve ﬁtting in Section 1.1, and thereforeto minimize a sum-of-squares error function. Given a training set comprising a setof input vectors {xn }, where n = 1, . . . , N , together with a corresponding set of5.2. Network Training233target vectors {tn }, we minimize the error function1y(xn , w) − tn 2 .2NE(w) =(5.11)n=1However, we can provide a much more general view of network training by ﬁrstgiving a probabilistic interpretation to the network outputs. We have already seenmany advantages of using probabilistic predictions in Section 1.5.4. Here it will alsoprovide us with a clearer motivation both for the choice of output unit nonlinearityand the choice of error function.We start by discussing regression problems, and for the moment we considera single target variable t that can take any real value.

Following the discussionsin Section 1.2.5 and 3.1, we assume that t has a Gaussian distribution with an xdependent mean, which is given by the output of the neural network, so thatp(t|x, w) = N t|y(x, w), β −1(5.12)where β is the precision (inverse variance) of the Gaussian noise. Of course thisis a somewhat restrictive assumption, and in Section 5.6 we shall see how to extendthis approach to allow for more general conditional distributions. For the conditionaldistribution given by (5.12), it is sufﬁcient to take the output unit activation functionto be the identity, because such a network can approximate any continuous functionfrom x to y.

Given a data set of N independent, identically distributed observationsX = {x1 , . . . , xN }, along with corresponding target values t = {t1 , . . . , tN }, wecan construct the corresponding likelihood functionp(t|X, w, β) =Np(tn |xn , w, β).n=1Taking the negative logarithm, we obtain the error functionNβNN{y(xn , w) − tn }2 −ln β +ln(2π)222(5.13)n=1which can be used to learn the parameters w and β. In Section 5.7, we shall discuss the Bayesian treatment of neural networks, while here we consider a maximumlikelihood approach. Note that in the neural networks literature, it is usual to consider the minimization of an error function rather than the maximization of the (log)likelihood, and so here we shall follow this convention.

Consider ﬁrst the determination of w. Maximizing the likelihood function is equivalent to minimizing thesum-of-squares error function given by1{y(xn , w) − tn }22NE(w) =n=1(5.14)2345. NEURAL NETWORKSwhere we have discarded additive and multiplicative constants. The value of w foundby minimizing E(w) will be denoted wML because it corresponds to the maximumlikelihood solution. In practice, the nonlinearity of the network function y(xn , w)causes the error E(w) to be nonconvex, and so in practice local maxima of thelikelihood may be found, corresponding to local minima of the error function, asdiscussed in Section 5.2.1.Having found wML , the value of β can be found by minimizing the negative loglikelihood to giveN11 ={y(xn , wML ) − tn }2 .(5.15)βMLNn=1Note that this can be evaluated once the iterative optimization required to ﬁnd wMLis completed.

If we have multiple target variables, and we assume that they are independent conditional on x and w with shared noise precision β, then the conditionaldistribution of the target values is given byp(t|x, w) = N t|y(x, w), β −1 I .(5.16)Exercise 5.2Following the same argument as for a single target variable, we see that the maximumlikelihood weights are determined by minimizing the sum-of-squares error function(5.11). The noise precision is then given by1βMLExercise 5.3=N1 y(xn , wML ) − tn 2NK(5.17)n=1where K is the number of target variables. The assumption of independence can bedropped at the expense of a slightly more complex optimization problem.Recall from Section 4.3.6 that there is a natural pairing of the error function(given by the negative log likelihood) and the output unit activation function.

In theregression case, we can view the network as having an output activation function thatis the identity, so that yk = ak . The corresponding sum-of-squares error functionhas the property∂E= yk − tk(5.18)∂akwhich we shall make use of when discussing error backpropagation in Section 5.3.Now consider the case of binary classiﬁcation in which we have a single targetvariable t such that t = 1 denotes class C1 and t = 0 denotes class C2 . Followingthe discussion of canonical link functions in Section 4.3.6, we consider a networkhaving a single output whose activation function is a logistic sigmoidy = σ(a) ≡11 + exp(−a)(5.19)so that 0 y(x, w) 1. We can interpret y(x, w) as the conditional probabilityp(C1 |x), with p(C2 |x) given by 1 − y(x, w).

The conditional distribution of targetsgiven inputs is then a Bernoulli distribution of the form1−tp(t|x, w) = y(x, w)t {1 − y(x, w)}.(5.20)5.2. Network Training235If we consider a training set of independent observations, then the error function,which is given by the negative log likelihood, is then a cross-entropy error functionof the formN{tn ln yn + (1 − tn ) ln(1 − yn )}(5.21)E(w) = −n=1Exercise 5.4where yn denotes y(xn , w). Note that there is no analogue of the noise precision βbecause the target values are assumed to be correctly labelled. However, the modelis easily extended to allow for labelling errors.

Simard et al. (2003) found that usingthe cross-entropy error function instead of the sum-of-squares for a classiﬁcationproblem leads to faster training as well as improved generalization.If we have K separate binary classiﬁcations to perform, then we can use a network having K outputs each of which has a logistic sigmoid activation function.Associated with each output is a binary class label tk ∈ {0, 1}, where k = 1, . . . , K.If we assume that the class labels are independent, given the input vector, then theconditional distribution of the targets isp(t|x, w) =K1−tkyk (x, w)tk [1 − yk (x, w)].(5.22)k=1Exercise 5.5Taking the negative logarithm of the corresponding likelihood function then givesthe following error functionE(w) = −N K{tnk ln ynk + (1 − tnk ) ln(1 − ynk )}(5.23)n=1 k=1Exercise 5.6where ynk denotes yk (xn , w). Again, the derivative of the error function with respect to the activation for a particular output unit takes the form (5.18) just as in theregression case.It is interesting to contrast the neural network solution to this problem with thecorresponding approach based on a linear classiﬁcation model of the kind discussedin Chapter 4.

Suppose that we are using a standard two-layer network of the kindshown in Figure 5.1. We see that the weight parameters in the ﬁrst layer of thenetwork are shared between the various outputs, whereas in the linear model eachclassiﬁcation problem is solved independently. The ﬁrst layer of the network canbe viewed as performing a nonlinear feature extraction, and the sharing of featuresbetween the different outputs can save on computation and can also lead to improvedgeneralization.Finally, we consider the standard multiclass classiﬁcation problem in which eachinput is assigned to one of K mutually exclusive classes. The binary target variablestk ∈ {0, 1} have a 1-of-K coding scheme indicating the class, and the networkoutputs are interpreted as yk (x, w) = p(tk = 1|x), leading to the following errorfunctionN Ktkn ln yk (xn , w).(5.24)E(w) = −n=1 k=12365.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.