An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 64

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 64 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 642020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 64)

Our assumption here is that the length of test documents is bounded. La would exceedb |C| Ma for extremely long test documents.Online edition (c) 2009 Cambridge UP13.3 The Bernoulli model263T RAIN B ERNOULLI NB (C, D )1 V ← E XTRACT V OCABULARY (D )2 N ← C OUNT D OCS (D )3 for each c ∈ C4 do Nc ← C OUNT D OCS I N C LASS (D, c)5prior [c] ← Nc /N6for each t ∈ V7do Nct ← C OUNT D OCS I N C LASS C ONTAINING T ERM (D, c, t)8condprob[t][c] ← ( Nct + 1)/( Nc + 2)9 return V, prior, condprobA PPLY B ERNOULLI NB (C, V, prior, condprob, d)1 Vd ← E XTRACT T ERMS F ROM D OC (V, d)2 for each c ∈ C3 do score[c] ← log prior [c]4for each t ∈ V5do if t ∈ Vd6then score[c] += log condprob[t][c]7else score[c] += log(1 − condprob[t][c])8 return arg maxc∈C score[c]◮ Figure 13.3 NB algorithm (Bernoulli model): Training and testing.

The add-onesmoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.13.3B ERNOULLI MODELThe Bernoulli modelThere are two different ways we can set up an NB classifier. The model we introduced in the previous section is the multinomial model. It generates oneterm from the vocabulary in each position of the document, where we assume a generative model that will be discussed in more detail in Section 13.4(see also page 237).An alternative to the multinomial model is the multivariate Bernoulli modelor Bernoulli model. It is equivalent to the binary independence model of Section 11.3 (page 222), which generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the document or 0 indicating absence.

Figure 13.3 presents training and testing algorithms for theBernoulli model. The Bernoulli model has the same time complexity as themultinomial model.The different generation models imply different estimation strategies anddifferent classification rules.

The Bernoulli model estimates P̂(t|c) as the fraction of documents of class c that contain term t (Figure 13.3, T RAIN B ERNOULLI -Online edition (c) 2009 Cambridge UP26413 Text classification and Naive BayesNB, line 8). In contrast, the multinomial model estimates P̂(t|c) as the fraction of tokens or fraction of positions in documents of class c that contain termt (Equation (13.7)). When classifying a test document, the Bernoulli modeluses binary occurrence information, ignoring the number of occurrences,whereas the multinomial model keeps track of multiple occurrences.

As aresult, the Bernoulli model typically makes many mistakes when classifyinglong documents. For example, it may assign an entire book to the class Chinabecause of a single occurrence of the term China.The models also differ in how nonoccurring terms are used in classification. They do not affect the classification decision in the multinomial model;but in the Bernoulli model the probability of nonoccurrence is factored inwhen computing P(c|d) (Figure 13.3, A PPLY B ERNOULLI NB, Line 7). This isbecause only the Bernoulli NB model models absence of terms explicitly.✎Applying the Bernoulli model to the example in Table 13.1, wehave the same estimates for the priors as before: P̂ (c) = 3/4, P̂ (c) = 1/4.

Theconditional probabilities are:Example 13.2:P̂ (Chinese| c)=(3 + 1)/(3 + 2) = 4/5=(0 + 1)/(3 + 2) = 1/5P̂ (Beijing| c) = P̂ (Macao| c) = P̂ (Shanghai| c)P̂ (Japan| c) = P̂ (Tokyo| c)P̂ (Japan| c) = P̂ (Tokyo| c)=(1 + 1)/(3 + 2) = 2/5P̂ (Chinese| c)=(1 + 1)/(1 + 2) = 2/3=(1 + 1)/(1 + 2) = 2/3P̂ (Beijing| c) = P̂ (Macao| c) = P̂ (Shanghai| c)=(0 + 1)/(1 + 2) = 1/3The denominators are (3 + 2) and (1 + 2) because there are three documents in cand one document in c and because the constant B in Equation (13.7) is 2 – there aretwo cases to consider for each term, occurrence and nonoccurrence.The scores of the test document for the two classes areP̂ (c| d5 )∝=≈P̂ (c) · P̂ (Chinese| c) · P̂ (Japan| c) · P̂ (Tokyo| c)· (1 − P̂ (Beijing| c)) · (1 − P̂ (Shanghai| c)) · (1 − P̂ (Macao| c))3/4 · 4/5 · 1/5 · 1/5 · (1 − 2/5) · (1 − 2/5) · (1 − 2/5)0.005and, analogously,P̂ (c | d5 )∝≈1/4 · 2/3 · 2/3 · 2/3 · (1 − 1/3) · (1 − 1/3) · (1 − 1/3)0.022Thus, the classifier assigns the test document to c = not-China.

When looking onlyat binary occurrence and not at term frequency, Japan and Tokyo are indicators for c(2/3 > 1/5) and the conditional probabilities of Chinese for c and c are not differentenough (4/5 vs. 2/3) to affect the classification decision.Online edition (c) 2009 Cambridge UP26513.4 Properties of Naive Bayes13.4Properties of Naive BayesTo gain a better understanding of the two models and the assumptions theymake, let us go back and examine how we derived their classification rules inChapters 11 and 12. We decide class membership of a document by assigningit to the class with the maximum a posteriori probability (cf. Section 11.3.2,page 226), which we compute as follows:cmap= arg max P(c|d)c ∈C(13.9)= arg maxc ∈C(13.10)P(d|c) P(c)P(d)= arg max P(d|c) P(c),c ∈Cwhere Bayes’ rule (Equation (11.4), page 220) is applied in (13.9) and we dropthe denominator in the last step because P(d) is the same for all classes anddoes not affect the argmax.We can interpret Equation (13.10) as a description of the generative processwe assume in Bayesian text classification.

To generate a document, we firstchoose class c with probability P(c) (top nodes in Figures 13.4 and 13.5). Thetwo models differ in the formalization of the second step, the generation ofthe document given the class, corresponding to the conditional distributionP ( d | c ):(13.11)Multinomial(13.12)BernoulliP(d|c)P(d|c)==P(ht1, . . . , tk , . . . , tnd i|c)P(he1, .

. . , ei , . . . , e M i|c),where ht1 , . . . , tnd i is the sequence of terms as it occurs in d (minus termsthat were excluded from the vocabulary) and he1 , . . . , ei , . . . , e M i is a binaryvector of dimensionality M that indicates for each term whether it occurs ind or not.It should now be clearer why we introduced the document space X inEquation (13.1) when we defined the classification problem. A critical stepin solving a text classification problem is to choose the document representation. ht1 , . .

. , tnd i and he1 , . . . , e M i are two different document representations. In the first case, X is the set of all term sequences (or, more precisely,sequences of term tokens). In the second case, X is {0, 1} M .We cannot use Equations (13.11) and (13.12) for text classification directly.For the Bernoulli model, we would have to estimate 2 M |C | different parameters, one for each possible combination of M values ei and a class.

Thenumber of parameters in the multinomial case has the same order of magni-Online edition (c) 2009 Cambridge UP26613 Text classification and Naive BayesC=ChinaX1 =BeijingX2 =andX3 =TaipeiX4 =joinX5 =WTO◮ Figure 13.4 The multinomial NB model.CONDITIONALINDEPENDENCEASSUMPTION(13.13)(13.14)RANDOM VARIABLEXRANDOM VARIABLE Utude.3 This being a very large quantity, estimating these parameters reliablyis infeasible.To reduce the number of parameters, we make the Naive Bayes conditionalindependence assumption. We assume that attribute values are independent ofeach other given the class:MultinomialBernoulliP(d|c)=P(d|c)=P(ht1, .

. . , tnd i|c) =P(he1 , . . . , e M i|c) =∏1≤k ≤n d∏1≤i ≤ MP ( Xk = t k | c )P ( Ui = e i | c ) .We have introduced two random variables here to make the two differentgenerative models explicit. Xk is the random variable for position k in thedocument and takes as values terms from the vocabulary. P( Xk = t|c) is theprobability that in a document of class c the term t will occur in position k.

Uiis the random variable for vocabulary term i and takes as values 0 (absence)and 1 (presence). P̂(Ui = 1|c) is the probability that in a document of class cthe term ti will occur – in any position and possibly multiple times.We illustrate the conditional independence assumption in Figures 13.4 and 13.5.The class China generates values for each of the five term attributes (multinomial) or six binary attributes (Bernoulli) with a certain probability, independent of the values of the other attributes. The fact that a document in theclass China contains the term Taipei does not make it more likely or less likelythat it also contains Beijing.In reality, the conditional independence assumption does not hold for textdata. Terms are conditionally dependent on each other.

But as we will discuss shortly, NB models perform well despite the conditional independenceassumption.3. In fact, if the length of documents is not bounded, the number of parameters in the multinomial case is infinite.Online edition (c) 2009 Cambridge UP26713.4 Properties of Naive BayesC=ChinaU Alaska =0U Beijing =1U India =0U join =1U Taipei =1U WTO =1◮ Figure 13.5 The Bernoulli NB model.POSITIONALINDEPENDENCEEven when assuming conditional independence, we still have too manyparameters for the multinomial model if we assume a different probabilitydistribution for each position k in the document. The position of a term in adocument by itself does not carry information about the class.

Althoughthere is a difference between China sues France and France sues China, theoccurrence of China in position 1 versus position 3 of the document is notuseful in NB classification because we look at each term separately. The conditional independence assumption commits us to this way of processing theevidence.Also, if we assumed different term distributions for each position k, wewould have to estimate a different set of parameters for each k. The probability of bean appearing as the first term of a coffee document could be differentfrom it appearing as the second term, and so on.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.