Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 31

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 31 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 312020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 31)

We can obtain a smootherdensity model if we choose a smoother kernel function, and a common choice is theGaussian, which gives rise to the following kernel density modelp(x) =N1 1x − xn 2exp−N2h2(2πh2 )1/2(2.250)n=1where h represents the standard deviation of the Gaussian components.

Thus ourdensity model is obtained by placing a Gaussian over each data point and then addingup the contributions over the whole data set, and then dividing by N so that the density is correctly normalized. In Figure 2.25, we apply the model (2.250) to the data1242. PROBABILITY DISTRIBUTIONSFigure 2.25Illustration of the kernel density model(2.250) applied to the same data set usedto demonstrate the histogram approach inFigure 2.24. We see that h acts as asmoothing parameter and that if it is settoo small (top panel), the result is a verynoisy density model, whereas if it is settoo large (bottom panel), then the bimodalnature of the underlying distribution fromwhich the data is generated (shown by thegreen curve) is washed out.

The best density model is obtained for some intermediate value of h (middle panel).505050h = 0.00500.510.510.51h = 0.070h = 0.20set used earlier to demonstrate the histogram technique. We see that, as expected,the parameter h plays the role of a smoothing parameter, and there is a trade-offbetween sensitivity to noise at small h and over-smoothing at large h. Again, theoptimization of h is a problem in model complexity, analogous to the choice of binwidth in histogram density estimation, or the degree of the polynomial used in curveﬁtting.We can choose any other kernel function k(u) in (2.249) subject to the conditionsk(u) 0,(2.251)k(u) du = 1(2.252)which ensure that the resulting probability distribution is nonnegative everywhereand integrates to one. The class of density model given by (2.249) is called a kerneldensity estimator, or Parzen estimator.

It has a great merit that there is no computation involved in the ‘training’ phase because this simply requires storage of thetraining set. However, this is also one of its great weaknesses because the computational cost of evaluating the density grows linearly with the size of the data set.2.5.2 Nearest-neighbour methodsOne of the difﬁculties with the kernel approach to density estimation is that theparameter h governing the kernel width is ﬁxed for all kernels.

In regions of highdata density, a large value of h may lead to over-smoothing and a washing out ofstructure that might otherwise be extracted from the data. However, reducing h maylead to noisy estimates elsewhere in data space where the density is smaller. Thusthe optimal choice for h may be dependent on location within the data space. Thisissue is addressed by nearest-neighbour methods for density estimation.We therefore return to our general result (2.246) for local density estimation,and instead of ﬁxing V and determining the value of K from the data, we considera ﬁxed value of K and use the data to ﬁnd an appropriate value for V . To do this,we consider a small sphere centred on the point x at which we wish to estimate the2.5.

Nonparametric MethodsFigure 2.26Illustration of K-nearest-neighbour density estimation using the same data setas in Figures 2.25 and 2.24. We seethat the parameter K governs the degreeof smoothing, so that a small value ofK leads to a very noisy density model(top panel), whereas a large value (bottom panel) smoothes out the bimodal nature of the true distribution (shown by thegreen curve) from which the data set wasgenerated.505050Exercise 2.61125K =100.510.510.51K =50K = 300density p(x), and we allow the radius of the sphere to grow until it contains preciselyK data points. The estimate of the density p(x) is then given by (2.246) with V set tothe volume of the resulting sphere.

This technique is known as K nearest neighboursand is illustrated in Figure 2.26, for various choices of the parameter K, using thesame data set as used in Figure 2.24 and Figure 2.25. We see that the value of Know governs the degree of smoothing and that again there is an optimum choice forK that is neither too large nor too small. Note that the model produced by K nearestneighbours is not a true density model because the integral over all space diverges.We close this chapter by showing how the K-nearest-neighbour technique fordensity estimation can be extended to the problem of classiﬁcation.

To do this, weapply the K-nearest-neighbour density estimation technique to each class separatelyand then make use of Bayes’ theorem. Let us suppose that wehave a data set comprising Nk points in class Ck with N points in total, so that k Nk = N . If wewish to classify a new point x, we draw a sphere centred on x containing preciselyK points irrespective of their class.

Suppose this sphere has volume V and containsKk points from class Ck . Then (2.246) provides an estimate of the density associatedwith each classKkp(x|Ck ) =.(2.253)Nk VSimilarly, the unconditional density is given byp(x) =KNV(2.254)p(Ck ) =Nk.N(2.255)while the class priors are given byWe can now combine (2.253), (2.254), and (2.255) using Bayes’ theorem to obtainthe posterior probability of class membershipp(Ck |x) =Kkp(x|Ck )p(Ck )=.p(x)K(2.256)1262. PROBABILITY DISTRIBUTIONSFigure 2.27 (a) In the K-nearestneighbour classiﬁer, a new point,shown by the black diamond, is classiﬁed according to the majority classmembership of the K closest training data points, in this case K =3.(b) In the nearest-neighbour(K = 1) approach to classiﬁcation,the resulting decision boundary iscomposed of hyperplanes that formperpendicular bisectors of pairs ofpoints from different classes.x2x2x1x1(a)(b)If we wish to minimize the probability of misclassiﬁcation, this is done by assigningthe test point x to the class having the largest posterior probability, corresponding tothe largest value of Kk /K.

Thus to classify a new point, we identify the K nearestpoints from the training data set and then assign the new point to the class having thelargest number of representatives amongst this set. Ties can be broken at random.The particular case of K = 1 is called the nearest-neighbour rule, because a testpoint is simply assigned to the same class as the nearest point from the training set.These concepts are illustrated in Figure 2.27.In Figure 2.28, we show the results of applying the K-nearest-neighbour algorithm to the oil ﬂow data, introduced in Chapter 1, for various values of K.

Asexpected, we see that K controls the degree of smoothing, so that small K producesmany small regions of each class, whereas large K leads to fewer larger regions.K =1K =32K = 312x72x710x7101x620101x62001x62Figure 2.28 Plot of 200 data points from the oil data set showing values of x6 plotted against x7 , where thered, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Alsoshown are the classiﬁcations of the input space given by the K-nearest-neighbour algorithm for various valuesof K.Exercises127An interesting property of the nearest-neighbour (K = 1) classiﬁer is that, in thelimit N → ∞, the error rate is never more than twice the minimum achievable errorrate of an optimal classiﬁer, i.e., one that uses the true class distributions (Cover andHart, 1967) .As discussed so far, both the K-nearest-neighbour method, and the kernel density estimator, require the entire training data set to be stored, leading to expensivecomputation if the data set is large.

This effect can be offset, at the expense of someadditional one-off computation, by constructing tree-based search structures to allow(approximate) near neighbours to be found efﬁciently without doing an exhaustivesearch of the data set. Nevertheless, these nonparametric methods are still severelylimited. On the other hand, we have seen that simple parametric models are veryrestricted in terms of the forms of distribution that they can represent. We thereforeneed to ﬁnd density models that are very ﬂexible and yet for which the complexityof the models can be controlled independently of the size of the training set, and weshall see in subsequent chapters how to achieve this.Exercises2.1 () wwwertiesVerify that the Bernoulli distribution (2.2) satisﬁes the following prop1p(x|µ) = 1(2.257)E[x] = µvar[x] = µ(1 − µ).(2.258)(2.259)x=0Show that the entropy H[x] of a Bernoulli distributed random binary variable x isgiven byH[x] = −µ ln µ − (1 − µ) ln(1 − µ).(2.260)2.2 ( ) The form of the Bernoulli distribution given by (2.2) is not symmetric between the two values of x.

In some situations, it will be more convenient to use anequivalent formulation for which x ∈ {−1, 1}, in which case the distribution can bewritten(1−x)/2 (1+x)/21−µ1+µ(2.261)p(x|µ) =22where µ ∈ [−1, 1]. Show that the distribution (2.261) is normalized, and evaluate itsmean, variance, and entropy.2.3 ( ) www In this exercise, we prove that the binomial distribution (2.9) is normalized.

First use the deﬁnition (2.10) of the number of combinations of m identicalobjects chosen from a total of N to show that NNN +1+=.(2.262)mm−1m1282. PROBABILITY DISTRIBUTIONSUse this result to prove by induction the following resultN N m(1 + x) =xmN(2.263)m=0which is known as the binomial theorem, and which is valid for all real values of x.Finally, show that the binomial distribution is normalized, so thatN N mµ (1 − µ)N −m = 1m(2.264)m=0which can be done by ﬁrst pulling out a factor (1 − µ)N out of the summation andthen making use of the binomial theorem.2.4 ( ) Show that the mean of the binomial distribution is given by (2.11).

To do this,differentiate both sides of the normalization condition (2.264) with respect to µ andthen rearrange to obtain an expression for the mean of n. Similarly, by differentiating(2.264) twice with respect to µ and making use of the result (2.11) for the mean ofthe binomial distribution prove the result (2.12) for the variance of the binomial.2.5 ( ) www In this exercise, we prove that the beta distribution, given by (2.13), iscorrectly normalized, so that (2.14) holds. This is equivalent to showing that 1Γ(a)Γ(b)µa−1 (1 − µ)b−1 dµ =.(2.265)Γ(a + b)0From the deﬁnition (1.141) of the gamma function, we have ∞ ∞a−1exp(−x)xdxexp(−y)y b−1 dy.Γ(a)Γ(b) =0(2.266)0Use this expression to prove (2.265) as follows.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.