Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 42

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 42 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 422020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 42)

To see this, consider the contours of the likelihood function and the prior as illustrated in Figure 3.15. Here we have implicitlytransformed to a rotated set of axes in parameter space aligned with the eigenvectors ui deﬁned in (3.87). Contours of the likelihood function are then axis-alignedellipses. The eigenvalues λi measure the curvature of the likelihood function, andso in Figure 3.15 the eigenvalue λ1 is small compared with λ2 (because a smallercurvature corresponds to a greater elongation of the contours of the likelihood function).

Because βΦT Φ is a positive deﬁnite matrix, it will have positive eigenvalues,and so the ratio λi /(λi + α) will lie between 0 and 1. Consequently, the quantity γdeﬁned by (3.91) will lie in the range 0 γ M . For directions in which λi α,the corresponding parameter wi will be close to its maximum likelihood value, andthe ratio λi /(λi + α) will be close to 1. Such parameters are called well determinedbecause their values are tightly constrained by the data. Conversely, for directionsin which λi α, the corresponding parameters wi will be close to zero, as will theratios λi /(λi + α). These are directions in which the likelihood function is relativelyinsensitive to the parameter value and so the parameter has been set to a small valueby the prior.

The quantity γ deﬁned by (3.91) therefore measures the effective totalnumber of well determined parameters.We can obtain some insight into the result (3.95) for re-estimating β by comparing it with the corresponding maximum likelihood result given by (3.21). Bothof these formulae express the variance (the inverse precision) as an average of thesquared differences between the targets and the model predictions.

However, theydiffer in that the number of data points N in the denominator of the maximum likelihood result is replaced by N − γ in the Bayesian result. We recall from (1.56) thatthe maximum likelihood estimate of the variance for a Gaussian distribution over a3.5.

The Evidence Approximation171single variable x is given by2σML=N1 (xn − µML )2N(3.96)n=1and that this estimate is biased because the maximum likelihood solution µML forthe mean has ﬁtted some of the noise on the data. In effect, this has used up onedegree of freedom in the model. The corresponding unbiased estimate is given by(1.59) and takes the form1 (xn − µML )2 .N −1N2σMAP=(3.97)n=1We shall see in Section 10.1.3 that this result can be obtained from a Bayesian treatment in which we marginalize over the unknown mean. The factor of N − 1 in thedenominator of the Bayesian result takes account of the fact that one degree of freedom has been used in ﬁtting the mean and removes the bias of maximum likelihood.Now consider the corresponding results for the linear regression model.

The meanof the target distribution is now given by the function wT φ(x), which contains Mparameters. However, not all of these parameters are tuned to the data. The effectivenumber of parameters that are determined by the data is γ, with the remaining M −γparameters set to small values by the prior. This is reﬂected in the Bayesian resultfor the variance that has a factor N − γ in the denominator, thereby correcting forthe bias of the maximum likelihood result.We can illustrate the evidence framework for setting hyperparameters using thesinusoidal synthetic data set from Section 1.1, together with the Gaussian basis function model comprising 9 basis functions, so that the total number of parameters inthe model is given by M = 10 including the bias.

Here, for simplicity of illustration, we have set β to its true value of 11.1 and then used the evidence framework todetermine α, as shown in Figure 3.16.We can also see how the parameter α controls the magnitude of the parameters{wi }, by plotting the individual parameters versus the effective number γ of parameters, as shown in Figure 3.17.If we consider the limit N M in which the number of data points is large inrelation to the number of parameters, then from (3.87) all of the parameters will bewell determined by the data because ΦT Φ involves an implicit sum over data points,and so the eigenvalues λi increase with the size of the data set. In this case, γ = M ,and the re-estimation equations for α and β becomeα =β=M2EW (mN )N2ED (mN )(3.98)(3.99)where EW and ED are deﬁned by (3.25) and (3.26), respectively.

These resultscan be used as an easy-to-compute approximation to the full evidence re-estimation1723. LINEAR MODELS FOR REGRESSION−50ln α5−50ln α5Figure 3.16 The left plot shows γ (red curve) and 2αEW (mN ) (blue curve) versus ln α for the sinusoidalsynthetic data set. It is the intersection of these two curves that deﬁnes the optimum value for α given by theevidence procedure. The right plot shows the corresponding graph of log evidence ln p(t|α, β) versus ln α (redcurve) showing that the peak coincides with the crossing point of the curves in the left plot. Also shown is thetest set error (blue curve) showing that the evidence maximum occurs close to the point of best generalization.formulae, because they do not require evaluation of the eigenvalue spectrum of theHessian.Figure 3.17Plot of the 10 parameters wifrom the Gaussian basis function2model versus the effective number of parameters γ, in which the w ihyperparameter α is varied in the1range 0 α ∞ causing γ tovary in the range 0 γ M .00845263−117−290246γ8103.6.

Limitations of Fixed Basis FunctionsThroughout this chapter, we have focussed on models comprising a linear combination of ﬁxed, nonlinear basis functions. We have seen that the assumption of linearityin the parameters led to a range of useful properties including closed-form solutionsto the least-squares problem, as well as a tractable Bayesian treatment. Furthermore,for a suitable choice of basis functions, we can model arbitrary nonlinearities in theExercises173mapping from input variables to targets. In the next chapter, we shall study an analogous class of models for classiﬁcation.It might appear, therefore, that such linear models constitute a general purposeframework for solving problems in pattern recognition.

Unfortunately, there aresome signiﬁcant shortcomings with linear models, which will cause us to turn inlater chapters to more complex models such as support vector machines and neuralnetworks.The difﬁculty stems from the assumption that the basis functions φj (x) are ﬁxedbefore the training data set is observed and is a manifestation of the curse of dimensionality discussed in Section 1.4. As a consequence, the number of basis functionsneeds to grow rapidly, often exponentially, with the dimensionality D of the inputspace.Fortunately, there are two properties of real data sets that we can exploit to helpalleviate this problem. First of all, the data vectors {xn } typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller than that of the input spaceas a result of strong correlations between the input variables. We will see an exampleof this when we consider images of handwritten digits in Chapter 12.

If we are usinglocalized basis functions, we can arrange that they are scattered in input space onlyin regions containing data. This approach is used in radial basis function networksand also in support vector and relevance vector machines. Neural network models,which use adaptive basis functions having sigmoidal nonlinearities, can adapt theparameters so that the regions of input space over which the basis functions varycorresponds to the data manifold.

The second property is that target variables mayhave signiﬁcant dependence on only a small number of possible directions within thedata manifold. Neural networks can exploit this property by choosing the directionsin input space to which the basis functions respond.Exercises3.1 () www Show that the ‘tanh’ function and the logistic sigmoid function (3.6)are related bytanh(a) = 2σ(2a) − 1.(3.100)Hence show that a general linear combination of logistic sigmoid functions of theformMx − µ jwj σ(3.101)y(x, w) = w0 +sj =1is equivalent to a linear combination of ‘tanh’ functions of the formy(x, u) = u0 +Mj =1uj tanhx − µ js(3.102)and ﬁnd expressions to relate the new parameters {u1 , . .

. , uM } to the original parameters {w1 , . . . , wM }.1743. LINEAR MODELS FOR REGRESSION3.2 ( ) Show that the matrixΦ(ΦT Φ)−1 ΦT(3.103)takes any vector v and projects it onto the space spanned by the columns of Φ. Usethis result to show that the least-squares solution (3.15) corresponds to an orthogonalprojection of the vector t onto the manifold S as shown in Figure 3.2.3.3 () Consider a data set in which each data point tn is associated with a weightingfactor rn > 0, so that the sum-of-squares error function becomesED (w) =N21 rn tn − wT φ(xn ) .2(3.104)n=1Find an expression for the solution w that minimizes this error function. Give twoalternative interpretations of the weighted sum-of-squares error function in terms of(i) data dependent noise variance and (ii) replicated data points.3.4 () wwwConsider a linear model of the formy(x, w) = w0 +Dwi xi(3.105)i=1together with a sum-of-squares error function of the form12ED (w) ={y(xn , w) − tn } .2N(3.106)n=1Now suppose that Gaussian noise i with zero mean and variance σ 2 is added independently to each of the input variables xi .

By making use of E[i ] = 0 andE[i j ] = δij σ 2 , show that minimizing ED averaged over the noise distribution isequivalent to minimizing the sum-of-squares error for noise-free input variables withthe addition of a weight-decay regularization term, in which the bias parameter w0is omitted from the regularizer.3.5 () www Using the technique of Lagrange multipliers, discussed in Appendix E,show that minimization of the regularized error function (3.29) is equivalent to minimizing the unregularized sum-of-squares error (3.12) subject to the constraint (3.30).Discuss the relationship between the parameters η and λ.3.6 () www Consider a linear basis function regression model for a multivariatetarget variable t having a Gaussian distribution of the formwherep(t|W, Σ) = N (t|y(x, W), Σ)(3.107)y(x, W) = WT φ(x)(3.108)Exercises175together with a training data set comprising input basis vectors φ(xn ) and corresponding target vectors tn , with n = 1, .

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.