Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 61

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 61 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 612020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 61)

Regularization in Neural Networks259will remain unchanged under the weight transformations provided the regularizationparameters are re-scaled using λ1 → a1/2 λ1 and λ2 → c−1/2 λ2 .The regularizer (5.121) corresponds to a prior of the formα 1 2 α2 2w −w .(5.122)p(w|α1 , α2 ) ∝ exp −22w∈W1w∈W2Note that priors of this form are improper (they cannot be normalized) because thebias parameters are unconstrained.

The use of improper priors can lead to difﬁcultiesin selecting regularization coefﬁcients and in model comparison within the Bayesianframework, because the corresponding evidence is zero. It is therefore common toinclude separate priors for the biases (which then break shift invariance) having theirown hyperparameters. We can illustrate the effect of the resulting four hyperparameters by drawing samples from the prior and plotting the corresponding networkfunctions, as shown in Figure 5.11.More generally, we can consider priors in which the weights are divided intoany number of groups Wk so that1p(w) ∝ exp −αk w2k(5.123)2kwherew2k =wj2 .(5.124)j∈WkAs a special case of this prior, if we choose the groups to correspond to the setsof weights associated with each of the input units, and we optimize the marginallikelihood with respect to the corresponding parameters αk , we obtain automaticrelevance determination as discussed in Section 7.2.2.5.5.2 Early stoppingAn alternative to regularization as a way of controlling the effective complexityof a network is the procedure of early stopping.

The training of nonlinear networkmodels corresponds to an iterative reduction of the error function deﬁned with respect to a set of training data. For many of the optimization algorithms used fornetwork training, such as conjugate gradients, the error is a nonincreasing functionof the iteration index. However, the error measured with respect to independent data,generally called a validation set, often shows a decrease at ﬁrst, followed by an increase as the network starts to over-ﬁt. Training can therefore be stopped at the pointof smallest error with respect to the validation data set, as indicated in Figure 5.12,in order to obtain a network having good generalization performance.The behaviour of the network in this case is sometimes explained qualitativelyin terms of the effective number of degrees of freedom in the network, in which thisnumber starts out small and then to grows during the training process, correspondingto a steady increase in the effective complexity of the model.

Halting training before2605. NEURAL NETWORKSbwbαw1 = 1, α 1 = 1, α 2 = 1, α 2 = 1422000−2−20−4−40−6−15−0.5αw1= 1000,0α1b= 100,0.5αw2= 1,α2b1−60−1=1500−5−5−10−1−0.500.5bwbαw1 = 1, α 1 = 1, α 2 = 10, α 2 = 1401−10−1−0.5αw1= 1000,−0.50α1b= 1000,00.5αw2= 1,α2b0.51=11Figure 5.11 Illustration of the effect of the hyperparameters governing the prior distribution over weights andbiases in a two-layer network having a single input, a single linear output, and 12 hidden units having ‘tanh’activation functions. The priors are governed by four hyperparameters α1b , α1w , α2b , and α2w , which representthe precisions of the Gaussian distributions of the ﬁrst-layer biases, ﬁrst-layer weights, second-layer biases, andsecond-layer weights, respectively. We see that the parameter α2w governs the vertical scale of functions (notethe different vertical axis ranges on the top two diagrams), α1w governs the horizontal scale of variations in thefunction values, and α1b governs the horizontal range over which variations occur.

The parameter α2b , whoseeffect is not illustrated here, governs the range of vertical offsets of the functions.Exercise 5.25a minimum of the training error has been reached then represents a way of limitingthe effective network complexity.In the case of a quadratic error function, we can verify this insight, and showthat early stopping should exhibit similar behaviour to regularization using a simple weight-decay term. This can be understood from Figure 5.13, in which the axesin weight space have been rotated to be parallel to the eigenvectors of the Hessianmatrix.

If, in the absence of weight decay, the weight vector starts at the origin andproceeds during training along a path that follows the local negative gradient vector, then the weight vector will move initially parallel to the w2 axis through a pointcorresponding roughly to w and then move towards the minimum of the error function wML . This follows from the shape of the error surface and the widely differing is therefore similar to weighteigenvalues of the Hessian.

Stopping at a point near wdecay. The relationship between early stopping and weight decay can be made quantitative, thereby showing that the quantity τ η (where τ is the iteration index, and ηis the learning rate parameter) plays the role of the reciprocal of the regularization5.5. Regularization in Neural Networks2610.450.250.40.20.15010203040500.3501020304050Figure 5.12 An illustration of the behaviour of training set error (left) and validation set error (right) during atypical training session, as a function of the iteration step, for the sinusoidal data set. The goal of achievingthe best generalization performance suggests that training should be stopped at the point shown by the verticaldashed lines, corresponding to the minimum of the validation set error.parameter λ.

The effective number of parameters in the network therefore growsduring the course of training.5.5.3 InvariancesIn many applications of pattern recognition, it is known that predictions shouldbe unchanged, or invariant, under one or more transformations of the input variables. For example, in the classiﬁcation of objects in two-dimensional images, suchas handwritten digits, a particular object should be assigned the same classiﬁcationirrespective of its position within the image (translation invariance) or of its size(scale invariance).

Such transformations produce signiﬁcant changes in the rawdata, expressed in terms of the intensities at each of the pixels in the image, andyet should give rise to the same output from the classiﬁcation system. Similarlyin speech recognition, small levels of nonlinear warping along the time axis, whichpreserve temporal ordering, should not change the interpretation of the signal.If sufﬁciently large numbers of training patterns are available, then an adaptivemodel such as a neural network can learn the invariance, at least approximately.

Thisinvolves including within the training set a sufﬁciently large number of examples ofthe effects of the various transformations. Thus, for translation invariance in an image, the training set should include examples of objects at many different positions.This approach may be impractical, however, if the number of training examplesis limited, or if there are several invariants (because the number of combinations oftransformations grows exponentially with the number of such transformations).

Wetherefore seek alternative approaches for encouraging an adaptive model to exhibitthe required invariances. These can broadly be divided into four categories:1. The training set is augmented using replicas of the training patterns, transformed according to the desired invariances. For instance, in our digit recognition example, we could make multiple copies of each example in which the2625.

NEURAL NETWORKSFigure 5.13A schematic illustration of whyearly stopping can give similarresults to weight decay in thecase of a quadratic error function. The ellipse shows a contour of constant error, and wMLdenotes the minimum of the error function. If the weight vectorstarts at the origin and moves according to the local negative gradient direction, then it will followthe path shown by the curve. Bystopping training early, a weighte is found that is qualvector witatively similar to that obtainedwith a simple weight-decay regularizer and training to the minimum of the regularized error, ascan be seen by comparing withFigure 3.15.w2wMLww1digit is shifted to a different position in each image.2. A regularization term is added to the error function that penalizes changes inthe model output when the input is transformed.

This leads to the technique oftangent propagation, discussed in Section 5.5.4.3. Invariance is built into the pre-processing by extracting features that are invariant under the required transformations. Any subsequent regression or classiﬁcation system that uses such features as inputs will necessarily also respectthese invariances.4. The ﬁnal option is to build the invariance properties into the structure of a neural network (or into the deﬁnition of a kernel function in the case of techniquessuch as the relevance vector machine). One way to achieve this is through theuse of local receptive ﬁelds and shared weights, as discussed in the context ofconvolutional neural networks in Section 5.5.6.Approach 1 is often relatively easy to implement and can be used to encourage complex invariances such as those illustrated in Figure 5.14.

For sequential trainingalgorithms, this can be done by transforming each input pattern before it is presentedto the model so that, if the patterns are being recycled, a different transformation(drawn from an appropriate distribution) is added each time. For batch methods, asimilar effect can be achieved by replicating each data point a number of times andtransforming each copy independently. The use of such augmented data can lead tosigniﬁcant improvements in generalization (Simard et al., 2003), although it can alsobe computationally costly.Approach 2 leaves the data set unchanged but modiﬁes the error function throughthe addition of a regularizer.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.