Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 16

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 16 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 162020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 16)

We close this chapter by introducing some additional concepts fromthe ﬁeld of information theory, which will also prove useful in our development ofpattern recognition and machine learning techniques. Again, we shall focus only onthe key concepts, and we refer the reader elsewhere for more detailed discussions(Viterbi and Omura, 1979; Cover and Thomas, 1991; MacKay, 2003) .We begin by considering a discrete random variable x and we ask how muchinformation is received when we observe a speciﬁc value for this variable. Theamount of information can be viewed as the ‘degree of surprise’ on learning thevalue of x. If we are told that a highly improbable event has just occurred, we willhave received more information than if we were told that some very likely eventhas just occurred, and if we knew that the event was certain to happen we wouldreceive no information.

Our measure of information content will therefore dependon the probability distribution p(x), and we therefore look for a quantity h(x) thatis a monotonic function of the probability p(x) and that expresses the informationcontent. The form of h(·) can be found by noting that if we have two events xand y that are unrelated, then the information gain from observing both of themshould be the sum of the information gained from each of them separately, so thath(x, y) = h(x) + h(y). Two unrelated events will be statistically independent andso p(x, y) = p(x)p(y). From these two relationships, it is easily shown that h(x)must be given by the logarithm of p(x) and so we have491.6.

Information Theory22q=1|y − t|q|y − t|qq = 0.310−2−10y −t110−222−112122q = 10|y − t|qq=2|y − t|q0y −t10−2−1Figure 1.290y −t1210−2−10y −tPlots of the quantity Lq = |y − t|q for various values of q.h(x) = − log2 p(x)(1.92)where the negative sign ensures that information is positive or zero. Note that lowprobability events x correspond to high information content.

The choice of basisfor the logarithm is arbitrary, and for the moment we shall adopt the conventionprevalent in information theory of using logarithms to the base of 2. In this case, aswe shall see shortly, the units of h(x) are bits (‘binary digits’).Now suppose that a sender wishes to transmit the value of a random variable toa receiver. The average amount of information that they transmit in the process isobtained by taking the expectation of (1.92) with respect to the distribution p(x) andis given byp(x) log2 p(x).(1.93)H[x] = −xThis important quantity is called the entropy of the random variable x. Note thatlimp→0 p ln p = 0 and so we shall take p(x) ln p(x) = 0 whenever we encounter avalue for x such that p(x) = 0.So far we have given a rather heuristic motivation for the deﬁnition of informa-501.

INTRODUCTIONtion (1.92) and the corresponding entropy (1.93). We now show that these deﬁnitionsindeed possess useful properties. Consider a random variable x having 8 possiblestates, each of which is equally likely. In order to communicate the value of x toa receiver, we would need to transmit a message of length 3 bits. Notice that theentropy of this variable is given byH[x] = −8 ×11log2 = 3 bits.88Now consider an example (Cover and Thomas, 1991) of a variable having 8 possible states {a, b, c, d, e, f, g, h} for which the respective probabilities are given by11111, 64, 64, 64, 64). The entropy in this case is given by( 12 , 14 , 18 , 161 11 1111141log2−log2= 2 bits.H[x] = − log2 − log2 − log2 −22 44 88 1616 6464We see that the nonuniform distribution has a smaller entropy than the uniform one,and we shall gain some insight into this shortly when we discuss the interpretation ofentropy in terms of disorder.

For the moment, let us consider how we would transmitthe identity of the variable’s state to a receiver. We could do this, as before, usinga 3-bit number. However, we can take advantage of the nonuniform distribution byusing shorter codes for the more probable events, at the expense of longer codes forthe less probable events, in the hope of getting a shorter average code length. Thiscan be done by representing the states {a, b, c, d, e, f, g, h} using, for instance, thefollowing set of code strings: 0, 10, 110, 1110, 111100, 111101, 111110, 111111.The average length of the code that has to be transmitted is thenaverage code length =11111×1+ ×2+ ×3+×4+4×× 6 = 2 bits2481664which again is the same as the entropy of the random variable.

Note that shorter codestrings cannot be used because it must be possible to disambiguate a concatenationof such strings into its component parts. For instance, 11001110 decodes uniquelyinto the state sequence c, a, d.This relation between entropy and shortest coding length is a general one. Thenoiseless coding theorem (Shannon, 1948) states that the entropy is a lower boundon the number of bits needed to transmit the state of a random variable.From now on, we shall switch to the use of natural logarithms in deﬁning entropy, as this will provide a more convenient link with ideas elsewhere in this book.In this case, the entropy is measured in units of ‘nats’ instead of bits, which differsimply by a factor of ln 2.We have introduced the concept of entropy in terms of the average amount ofinformation needed to specify the state of a random variable.

In fact, the concept ofentropy has much earlier origins in physics where it was introduced in the contextof equilibrium thermodynamics and later given a deeper interpretation as a measureof disorder through developments in statistical mechanics. We can understand thisalternative view of entropy by considering a set of N identical objects that are to bedivided amongst a set of bins, such that there are ni objects in the ith bin. Consider1.6. Information Theory51the number of different ways of allocating the objects to the bins. There are Nways to choose the ﬁrst object, (N − 1) ways to choose the second object, andso on, leading to a total of N ! ways to allocate all N objects to the bins, where N !(pronounced ‘factorial N ’) denotes the product N × (N − 1) × · · · × 2 × 1. However,we don’t wish to distinguish between rearrangements of objects within each bin.

Inthe ith bin there are ni ! ways of reordering the objects, and so the total number ofways of allocating the N objects to the bins is given byN!W =i ni !(1.94)which is called the multiplicity. The entropy is then deﬁned as the logarithm of themultiplicity scaled by an appropriate constantH=11 1ln ni !.ln W =ln N ! −NNN(1.95)iWe now consider the limit N → ∞, in which the fractions ni /N are held ﬁxed, andapply Stirling’s approximationln N ! N ln N − Nwhich givesH = − limN →∞ ni iNlnn iN=−(1.96)pi ln pi(1.97)iwhere we have used i ni = N . Here pi = limN →∞ (ni /N ) is the probabilityof an object being assigned to the ith bin. In physics terminology, the speciﬁc arrangements of objects in the bins is called a microstate, and the overall distributionof occupation numbers, expressed through the ratios ni /N , is called a macrostate.The multiplicity W is also known as the weight of the macrostate.We can interpret the bins as the states xi of a discrete random variable X, wherep(X = xi ) = pi .

The entropy of the random variable X is thenH[p] = −p(xi ) ln p(xi ).(1.98)iAppendix EDistributions p(xi ) that are sharply peaked around a few values will have a relativelylow entropy, whereas those that are spread more evenly across many values willhave higher entropy, as illustrated in Figure 1.30. Because 0 pi 1, the entropyis nonnegative, and it will equal its minimum value of 0 when one of the pi =1 and all other pj=i = 0. The maximum entropy conﬁguration can be found bymaximizing H using a Lagrange multiplier to enforce the normalization constrainton the probabilities.

Thus we maximize =−Hp(xi ) ln p(xi ) + λp(xi ) − 1(1.99)ii521. INTRODUCTION0.50.5H = 3.09probabilitiesprobabilitiesH = 1.770.2500.250Figure 1.30 Histograms of two probability distributions over 30 bins illustrating the higher value of the entropyH for the broader distribution. The largest entropy would arise from a uniform distribution that would give H =− ln(1/30) = 3.40.Exercise 1.29from which we ﬁnd that all of the p(xi ) are equal and are given by p(xi ) = 1/Mwhere M is the total number of states xi . The corresponding value of the entropyis then H = ln M .

This result can also be derived from Jensen’s inequality (to bediscussed shortly). To verify that the stationary point is indeed a maximum, we canevaluate the second derivative of the entropy, which gives∂H1= −Iij∂p(xi )∂p(xj )pi(1.100)where Iij are the elements of the identity matrix.We can extend the deﬁnition of entropy to include distributions p(x) over continuous variables x as follows. First divide x into bins of width ∆. Then, assumingp(x) is continuous, the mean value theorem (Weisstein, 1999) tells us that, for eachsuch bin, there must exist a value xi such that (i+1)∆p(x) dx = p(xi )∆.(1.101)i∆We can now quantize the continuous variable x by assigning any value x to the valuexi whenever x falls in the ith bin. The probability of observing the value xi is thenp(xi )∆. This gives a discrete distribution for which the entropy takes the formp(xi )∆ ln (p(xi )∆) = −p(xi )∆ ln p(xi ) − ln ∆(1.102)H∆ = −iiwhere we have used i p(xi )∆ = 1, which follows from (1.101).

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.