Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 20

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 20 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 202020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 20)

. . , xN ofobservations. This problem is known as density estimation. For the purposes ofthis chapter, we shall assume that the data points are independent and identicallydistributed. It should be emphasized that the problem of density estimation is fun-67682. PROBABILITY DISTRIBUTIONSdamentally ill-posed, because there are inﬁnitely many probability distributions thatcould have given rise to the observed ﬁnite data set.

Indeed, any distribution p(x)that is nonzero at each of the data points x1 , . . . , xN is a potential candidate. Theissue of choosing an appropriate distribution relates to the problem of model selection that has already been encountered in the context of polynomial curve ﬁtting inChapter 1 and that is a central issue in pattern recognition.We begin by considering the binomial and multinomial distributions for discreterandom variables and the Gaussian distribution for continuous random variables.These are speciﬁc examples of parametric distributions, so-called because they aregoverned by a small number of adaptive parameters, such as the mean and variance inthe case of a Gaussian for example. To apply such models to the problem of densityestimation, we need a procedure for determining suitable values for the parameters,given an observed data set.

In a frequentist treatment, we choose speciﬁc valuesfor the parameters by optimizing some criterion, such as the likelihood function. Bycontrast, in a Bayesian treatment we introduce prior distributions over the parametersand then use Bayes’ theorem to compute the corresponding posterior distributiongiven the observed data.We shall see that an important role is played by conjugate priors, that lead toposterior distributions having the same functional form as the prior, and that therefore lead to a greatly simpliﬁed Bayesian analysis.

For example, the conjugate priorfor the parameters of the multinomial distribution is called the Dirichlet distribution,while the conjugate prior for the mean of a Gaussian is another Gaussian. All of thesedistributions are examples of the exponential family of distributions, which possessa number of important properties, and which will be discussed in some detail.One limitation of the parametric approach is that it assumes a speciﬁc functionalform for the distribution, which may turn out to be inappropriate for a particularapplication. An alternative approach is given by nonparametric density estimationmethods in which the form of the distribution typically depends on the size of the dataset.

Such models still contain parameters, but these control the model complexityrather than the form of the distribution. We end this chapter by considering threenonparametric methods based respectively on histograms, nearest-neighbours, andkernels.2.1. Binary VariablesWe begin by considering a single binary random variable x ∈ {0, 1}. For example,x might describe the outcome of ﬂipping a coin, with x = 1 representing ‘heads’,and x = 0 representing ‘tails’.

We can imagine that this is a damaged coin so thatthe probability of landing heads is not necessarily the same as that of landing tails.The probability of x = 1 will be denoted by the parameter µ so thatp(x = 1|µ) = µ(2.1)2.1. Binary Variables69where 0 µ 1, from which it follows that p(x = 0|µ) = 1 − µ. The probabilitydistribution over x can therefore be written in the formBern(x|µ) = µx (1 − µ)1−xExercise 2.1(2.2)which is known as the Bernoulli distribution. It is easily veriﬁed that this distributionis normalized and that it has mean and variance given byE[x] = µvar[x] = µ(1 − µ).(2.3)(2.4)Now suppose we have a data set D = {x1 , .

. . , xN } of observed values of x.We can construct the likelihood function, which is a function of µ, on the assumptionthat the observations are drawn independently from p(x|µ), so thatp(D|µ) =Np(xn |µ) =n=1Nµxn (1 − µ)1−xn .(2.5)n=1In a frequentist setting, we can estimate a value for µ by maximizing the likelihoodfunction, or equivalently by maximizing the logarithm of the likelihood. In the caseof the Bernoulli distribution, the log likelihood function is given byln p(D|µ) =Nln p(xn |µ) =n=1Section 2.4N{xn ln µ + (1 − xn ) ln(1 − µ)} .(2.6)n=1At this point, it is worth noting that thelog likelihood function depends on the Nobservations xn only through their sum n xn . This sum provides an example of asufﬁcient statistic for the data under this distribution, and we shall study the important role of sufﬁcient statistics in some detail. If we set the derivative of ln p(D|µ)with respect to µ equal to zero, we obtain the maximum likelihood estimatorµML =N1 xnN(2.7)n=1Jacob Bernoulli1654–1705Jacob Bernoulli, also known asJacques or James Bernoulli, was aSwiss mathematician and was theﬁrst of many in the Bernoulli familyto pursue a career in science andmathematics.

Although compelledto study philosophy and theology against his will byhis parents, he travelled extensively after graduatingin order to meet with many of the leading scientists ofhis time, including Boyle and Hooke in England. Whenhe returned to Switzerland, he taught mechanics andbecame Professor of Mathematics at Basel in 1687.Unfortunately, rivalry between Jacob and his youngerbrother Johann turned an initially productive collaboration into a bitter and public dispute.

Jacob’s most signiﬁcant contributions to mathematics appeared in TheArt of Conjecture published in 1713, eight years afterhis death, which deals with topics in probability theory including what has become known as the Bernoullidistribution.702. PROBABILITY DISTRIBUTIONSFigure 2.1Histogram plot of the binomial dis0.3tribution (2.9) as a function of m forN = 10 and µ = 0.25.0.20.10012345m678910which is also known as the sample mean. If we denote the number of observationsof x = 1 (heads) within this data set by m, then we can write (2.7) in the formµML =mN(2.8)so that the probability of landing heads is given, in this maximum likelihood framework, by the fraction of observations of heads in the data set.Now suppose we ﬂip a coin, say, 3 times and happen to observe 3 heads. ThenN = m = 3 and µML = 1.

In this case, the maximum likelihood result wouldpredict that all future observations should give heads. Common sense tells us thatthis is unreasonable, and in fact this is an extreme example of the over-ﬁtting associated with maximum likelihood. We shall see shortly how to arrive at more sensibleconclusions through the introduction of a prior distribution over µ.We can also work out the distribution of the number m of observations of x = 1,given that the data set has size N .

This is called the binomial distribution, andfrom (2.5) we see that it is proportional to µm (1 − µ)N −m . In order to obtain thenormalization coefﬁcient we note that out of N coin ﬂips, we have to add up allof the possible ways of obtaining m heads, so that the binomial distribution can bewritten N mµ (1 − µ)N −m(2.9)Bin(m|N, µ) =mwhereExercise 2.3 NN!≡(N − m)!m!m(2.10)is the number of ways of choosing m objects out of a total of N identical objects.Figure 2.1 shows a plot of the binomial distribution for N = 10 and µ = 0.25.The mean and variance of the binomial distribution can be found by using theresult of Exercise 1.10, which shows that for independent events the mean of thesum is the sum of the means, and the variance of the sum is the sum of the variances.Because m = x1 + .

. . + xN , and for each observation the mean and variance are2.1. Binary Variables71given by (2.3) and (2.4), respectively, we haveE[m] ≡NmBin(m|N, µ) = N µ(2.11)m=0var[m] ≡N2(m − E[m]) Bin(m|N, µ) = N µ(1 − µ).(2.12)m=0Exercise 2.4These results can also be proved directly using calculus.2.1.1 The beta distributionWe have seen in (2.8) that the maximum likelihood setting for the parameter µin the Bernoulli distribution, and hence in the binomial distribution, is given by thefraction of the observations in the data set having x = 1. As we have already noted,this can give severely over-ﬁtted results for small data sets. In order to develop aBayesian treatment for this problem, we need to introduce a prior distribution p(µ)over the parameter µ.

Here we consider a form of prior distribution that has a simpleinterpretation as well as some useful analytical properties. To motivate this prior,we note that the likelihood function takes the form of the product of factors of theform µx (1 − µ)1−x . If we choose a prior to be proportional to powers of µ and(1 − µ), then the posterior distribution, which is proportional to the product of theprior and the likelihood function, will have the same functional form as the prior.This property is called conjugacy and we will see several examples of it later in thischapter. We therefore choose a prior, called the beta distribution, given byBeta(µ|a, b) =Exercise 2.5Γ(a + b) a−1µ (1 − µ)b−1Γ(a)Γ(b)(2.13)where Γ(x) is the gamma function deﬁned by (1.141), and the coefﬁcient in (2.13)ensures that the beta distribution is normalized, so that 1Beta(µ|a, b) dµ = 1.(2.14)0Exercise 2.6The mean and variance of the beta distribution are given byaE[µ] =a+babvar[µ] =.2(a + b) (a + b + 1)(2.15)(2.16)The parameters a and b are often called hyperparameters because they control thedistribution of the parameter µ.

Figure 2.2 shows plots of the beta distribution forvarious values of the hyperparameters.The posterior distribution of µ is now obtained by multiplying the beta prior(2.13) by the binomial likelihood function (2.9) and normalizing. Keeping only thefactors that depend on µ, we see that this posterior distribution has the formp(µ|m, l, a, b) ∝ µm+a−1 (1 − µ)l+b−1(2.17)722. PROBABILITY DISTRIBUTIONS33a=1a = 0.1b=1b = 0.12211000.5µ1300µ10.5µ13a=2a=8b=3b=4221100.500.5µ100Figure 2.2 Plots of the beta distribution Beta(µ|a, b) given by (2.13) as a function of µ for various values of thehyperparameters a and b.where l = N − m, and therefore corresponds to the number of ‘tails’ in the coinexample. We see that (2.17) has the same functional dependence on µ as the priordistribution, reﬂecting the conjugacy properties of the prior with respect to the likelihood function.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.