Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 17

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 17 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 172020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 17)

We now omitthe second term − ln ∆ on the right-hand side of (1.102) and then consider the limit1.6. Information Theory53∆ → 0. The ﬁrst term on the right-hand side of (1.102) will approach the integral ofp(x) ln p(x) in this limit so thatlimp(xi )∆ ln p(xi ) = − p(x) ln p(x) dx(1.103)∆→0iwhere the quantity on the right-hand side is called the differential entropy. We seethat the discrete and continuous forms of the entropy differ by a quantity ln ∆, whichdiverges in the limit ∆ → 0. This reﬂects the fact that to specify a continuousvariable very precisely requires a large number of bits.

For a density deﬁned overmultiple continuous variables, denoted collectively by the vector x, the differentialentropy is given byH[x] = − p(x) ln p(x) dx.(1.104)In the case of discrete distributions, we saw that the maximum entropy conﬁguration corresponded to an equal distribution of probabilities across the possiblestates of the variable. Let us now consider the maximum entropy conﬁguration fora continuous variable.

In order for this maximum to be well deﬁned, it will be necessary to constrain the ﬁrst and second moments of p(x) as well as preserving thenormalization constraint. We therefore maximize the differential entropy with theLudwig Boltzmann1844–1906Ludwig Eduard Boltzmann was anAustrian physicist who created theﬁeld of statistical mechanics.

Priorto Boltzmann, the concept of entropy was already known fromclassical thermodynamics where itquantiﬁes the fact that when we take energy from asystem, not all of that energy is typically availableto do useful work. Boltzmann showed that the thermodynamic entropy S, a macroscopic quantity, couldbe related to the statistical properties at the microscopic level. This is expressed through the famousequation S = k ln W in which W represents thenumber of possible microstates in a macrostate, andk 1.38 × 10−23 (in units of Joules per Kelvin) isknown as Boltzmann’s constant.

Boltzmann’s ideaswere disputed by many scientists of they day. One difﬁculty they saw arose from the second law of thermo-dynamics, which states that the entropy of a closedsystem tends to increase with time. By contrast, atthe microscopic level the classical Newtonian equations of physics are reversible, and so they found itdifﬁcult to see how the latter could explain the former. They didn’t fully appreciate Boltzmann’s arguments, which were statistical in nature and which concluded not that entropy could never decrease overtime but simply that with overwhelming probability itwould generally increase. Boltzmann even had a longrunning dispute with the editor of the leading Germanphysics journal who refused to let him refer to atomsand molecules as anything other than convenient theoretical constructs.

The continued attacks on his worklead to bouts of depression, and eventually he committed suicide. Shortly after Boltzmann’s death, newexperiments by Perrin on colloidal suspensions veriﬁed his theories and conﬁrmed the value of the Boltzmann constant. The equation S = k ln W is carved onBoltzmann’s tombstone.541. INTRODUCTIONthree constraints∞−∞Appendix EExercise 1.34Exercise 1.35p(x) dx = 1(1.105)xp(x) dx = µ(1.106)−∞∞−∞(x − µ)2 p(x) dx = σ 2 .(1.107)The constrained maximization can be performed using Lagrange multipliers so thatwe maximize the following functional with respect to p(x) ∞ ∞−p(x) ln p(x) dx + λ1p(x) dx − 1−∞−∞ ∞ ∞22+λ2xp(x) dx − µ + λ3(x − µ) p(x) dx − σ .−∞Appendix D∞−∞Using the calculus of variations, we set the derivative of this functional to zero giving(1.108)p(x) = exp −1 + λ1 + λ2 x + λ3 (x − µ)2 .The Lagrange multipliers can be found by back substitution of this result into thethree constraint equations, leading ﬁnally to the result(x − µ)21exp −(1.109)p(x) =2σ 2(2πσ 2 )1/2and so the distribution that maximizes the differential entropy is the Gaussian.

Notethat we did not constrain the distribution to be nonnegative when we maximized theentropy. However, because the resulting distribution is indeed nonnegative, we seewith hindsight that such a constraint is not necessary.If we evaluate the differential entropy of the Gaussian, we obtainH[x] =11 + ln(2πσ 2 ) .2(1.110)Thus we see again that the entropy increases as the distribution becomes broader,i.e., as σ 2 increases. This result also shows that the differential entropy, unlike thediscrete entropy, can be negative, because H(x) < 0 in (1.110) for σ 2 < 1/(2πe).Suppose we have a joint distribution p(x, y) from which we draw pairs of valuesof x and y. If a value of x is already known, then the additional information neededto specify the corresponding value of y is given by − ln p(y|x).

Thus the averageadditional information needed to specify y can be written asH[y|x] = −p(y, x) ln p(y|x) dy dx(1.111)1.6. Information TheoryExercise 1.3755which is called the conditional entropy of y given x. It is easily seen, using theproduct rule, that the conditional entropy satisﬁes the relationH[x, y] = H[y|x] + H[x](1.112)where H[x, y] is the differential entropy of p(x, y) and H[x] is the differential entropy of the marginal distribution p(x). Thus the information needed to describe xand y is given by the sum of the information needed to describe x alone plus theadditional information required to specify y given x.1.6.1 Relative entropy and mutual informationSo far in this section, we have introduced a number of concepts from informationtheory, including the key notion of entropy. We now start to relate these ideas topattern recognition. Consider some unknown distribution p(x), and suppose thatwe have modelled this using an approximating distribution q(x).

If we use q(x) toconstruct a coding scheme for the purpose of transmitting values of x to a receiver,then the average additional amount of information (in nats) required to specify thevalue of x (assuming we choose an efﬁcient coding scheme) as a result of using q(x)instead of the true distribution p(x) is given by KL(pq) = − p(x) ln q(x) dx − − p(x) ln p(x) dxq(x)= − p(x) lndx.(1.113)p(x)This is known as the relative entropy or Kullback-Leibler divergence, or KL divergence (Kullback and Leibler, 1951), between the distributions p(x) and q(x).

Notethat it is not a symmetrical quantity, that is to say KL(pq) ≡ KL(qp).We now show that the Kullback-Leibler divergence satisﬁes KL(pq) 0 withequality if, and only if, p(x) = q(x). To do this we ﬁrst introduce the concept ofconvex functions. A function f (x) is said to be convex if it has the property thatevery chord lies on or above the function, as shown in Figure 1.31.

Any value of xin the interval from x = a to x = b can be written in the form λa + (1 − λ)b where0 λ 1. The corresponding point on the chord is given by λf (a) + (1 − λ)f (b),Claude Shannon1916–2001After graduating from Michigan andMIT, Shannon joined the AT&T BellTelephone laboratories in 1941. Hispaper ‘A Mathematical Theory ofCommunication’ published in theBell System Technical Journal in1948 laid the foundations for modern information the-ory. This paper introduced the word ‘bit’, and his concept that information could be sent as a stream of 1sand 0s paved the way for the communications revolution. It is said that von Neumann recommended toShannon that he use the term entropy, not only because of its similarity to the quantity used in physics,but also because “nobody knows what entropy reallyis, so in any discussion you will always have an advantage”.561.

INTRODUCTIONFigure 1.31A convex function f (x) is one for which every chord (shown in blue) lies on or abovethe function (shown in red).f (x)chordaxλbxand the corresponding value of the function is f (λa + (1 − λ)b). Convexity thenimpliesf (λa + (1 − λ)b) λf (a) + (1 − λ)f (b).(1.114)Exercise 1.36Exercise 1.38This is equivalent to the requirement that the second derivative of the function beeverywhere positive. Examples of convex functions are x ln x (for x > 0) and x2 . Afunction is called strictly convex if the equality is satisﬁed only for λ = 0 and λ = 1.If a function has the opposite property, namely that every chord lies on or below thefunction, it is called concave, with a corresponding deﬁnition for strictly concave. Ifa function f (x) is convex, then −f (x) will be concave.Using the technique of proof by induction, we can show from (1.114) that aconvex function f (x) satisﬁesMMλi xi λi f (xi )(1.115)fi=1i=1where λi 0 and i λi = 1, for any set of points {xi }.

The result (1.115) isknown as Jensen’s inequality. If we interpret the λi as the probability distributionover a discrete variable x taking the values {xi }, then (1.115) can be writtenf (E[x]) E[f (x)](1.116)where E[·] denotes the expectation. For continuous variables, Jensen’s inequalitytakes the form xp(x) dx f (x)p(x) dx.(1.117)fWe can apply Jensen’s inequality in the form (1.117) to the Kullback-Leiblerdivergence (1.113) to giveq(x)dx − ln q(x) dx = 0(1.118)KL(pq) = − p(x) lnp(x)1.6. Information Theory57where we have used the fact that − ln x is a convex function, together with the normalization condition q(x) dx = 1.

In fact, − ln x is a strictly convex function,so the equality will hold if, and only if, q(x) = p(x) for all x. Thus we can interpret the Kullback-Leibler divergence as a measure of the dissimilarity of the twodistributions p(x) and q(x).We see that there is an intimate relationship between data compression and density estimation (i.e., the problem of modelling an unknown probability distribution)because the most efﬁcient compression is achieved when we know the true distribution. If we use a distribution that is different from the true one, then we mustnecessarily have a less efﬁcient coding, and on average the additional informationthat must be transmitted is (at least) equal to the Kullback-Leibler divergence between the two distributions.Suppose that data is being generated from an unknown distribution p(x) that wewish to model.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.