An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 63

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 63 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 632020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 63)

P(c) is the prior probability of adocument occurring in class c. If a document’s terms do not provide clearevidence for one class versus another, we choose the one that has a higherprior probability. ht1 , t2 , . . . , tnd i are the tokens in d that are part of the vocabulary we use for classification and nd is the number of such tokens in d.

Forexample, ht1 , t2 , . . . , tnd i for the one-sentence document Beijing and Taipei jointhe WTO might be hBeijing, Taipei, join, WTOi, with nd = 4, if we treat the termsand and the as stop words.In text classification, our goal is to find the best class for the document. Thebest class in NB classification is the most likely or maximum a posteriori (MAP)class cmap :cmap = arg max P̂(c|d) = arg max P̂(c)c ∈Cc ∈C∏1≤k ≤n dP̂(tk |c).We write P̂ for P because we do not know the true values of the parametersP(c) and P(tk |c), but estimate them from the training set as we will see in amoment.In Equation (13.3), many conditional probabilities are multiplied, one foreach position 1 ≤ k ≤ nd .

This can result in a floating point underflow.It is therefore better to perform the computation by adding logarithms ofprobabilities instead of multiplying probabilities. The class with the highestlog probability score is still the most probable; log( xy) = log( x ) + log(y)and the logarithm function is monotonic. Hence, the maximization that is1. We will explain in the next section why P ( c|d) is proportional to (∝), not equal to the quantityon the right.Online edition (c) 2009 Cambridge UP25913.2 Naive Bayes text classificationactually done in most implementations of NB is:(13.4)cmap = arg max [log P̂(c) +c ∈C∑1≤k ≤n dlog P̂(tk |c)].Equation (13.4) has a simple interpretation.

Each conditional parameterlog P̂(tk |c) is a weight that indicates how good an indicator tk is for c. Similarly, the prior log P̂(c) is a weight that indicates the relative frequency ofc. More frequent classes are more likely to be the correct class than infrequent classes. The sum of log prior and term weights is then a measure ofhow much evidence there is for the document being in the class, and Equation (13.4) selects the class for which we have the most evidence.We will initially work with this intuitive interpretation of the multinomialNB model and defer a formal derivation to Section 13.4.How do we estimate the parameters P̂(c) and P̂(tk |c)? We first try themaximum likelihood estimate (MLE; Section 11.3.2, page 226), which is simply the relative frequency and corresponds to the most likely value of eachparameter given the training data.

For the priors this estimate is:(13.5)P̂(c) =Nc,Nwhere Nc is the number of documents in class c and N is the total number ofdocuments.We estimate the conditional probability P̂(t|c) as the relative frequency ofterm t in documents belonging to class c:(13.6)P̂(t|c) =Tct,∑t′ ∈V Tct′where Tct is the number of occurrences of t in training documents from classc, including multiple occurrences of a term in a document. We have made thepositional independence assumption here, which we will discuss in more detailin the next section: Tct is a count of occurrences in all positions k in the documents in the training set. Thus, we do not compute different estimates fordifferent positions and, for example, if a word occurs twice in a document,in positions k1 and k2 , then P̂(tk1 |c) = P̂(tk2 |c).The problem with the MLE estimate is that it is zero for a term–class combination that did not occur in the training data.

If the term WTO in the trainingdata only occurred in China documents, then the MLE estimates for the otherclasses, for example UK, will be zero:P̂(WTO|UK) = 0.Now, the one-sentence document Britain is a member of the WTO will get aconditional probability of zero for UK because we are multiplying the conditional probabilities for all terms in Equation (13.2). Clearly, the model shouldOnline edition (c) 2009 Cambridge UP26013 Text classification and Naive BayesT RAIN M ULTINOMIAL NB (C, D )1 V ← E XTRACT V OCABULARY (D )2 N ← C OUNT D OCS (D )3 for each c ∈ C4 do Nc ← C OUNT D OCS I N C LASS (D, c)5prior [c] ← Nc /N6textc ← C ONCATENATE T EXT O FA LL D OCS I N C LASS(D, c)7for each t ∈ V8do Tct ← C OUNT T OKENS O F T ERM (textc , t)9for each t ∈ V10do condprob[t][c] ← ∑ ′T(ctT+′1+1)ctt11 return V, prior, condprobA PPLY M ULTINOMIAL NB (C, V, prior, condprob, d)1 W ← E XTRACT T OKENS F ROM D OC (V, d)2 for each c ∈ C3 do score[c] ← log prior [c]4for each t ∈ W5do score[c] += log condprob[t][c]6 return arg maxc∈C score[c]◮ Figure 13.2 Naive Bayes algorithm (multinomial model): Training and testing.SPARSENESSADD - ONE SMOOTHING(13.7)assign a high probability to the UK class because the term Britain occurs.

Theproblem is that the zero probability for WTO cannot be “conditioned away,”no matter how strong the evidence for the class UK from other features. Theestimate is 0 because of sparseness: The training data are never large enoughto represent the frequency of rare events adequately, for example, the frequency of WTO occurring in UK documents.To eliminate zeros, we use add-one or Laplace smoothing, which simply addsone to each count (cf. Section 11.3.2):P̂(t|c) =Tct + 1Tct + 1,=(∑t′ ∈V Tct′ ) + B∑t′ ∈V ( Tct′ + 1)where B = |V | is the number of terms in the vocabulary.

Add-one smoothingcan be interpreted as a uniform prior (each term occurs once for each class)that is then updated as evidence from the training data comes in. Note thatthis is a prior probability for the occurrence of a term as opposed to the priorprobability of a class which we estimate in Equation (13.5) on the documentlevel.Online edition (c) 2009 Cambridge UP26113.2 Naive Bayes text classification◮ Table 13.1 Data for parameter estimation examples.training settest setdocID12345words in documentChinese Beijing ChineseChinese Chinese ShanghaiChinese MacaoTokyo Japan ChineseChinese Chinese Chinese Tokyo Japanin c = China?yesyesyesno?◮ Table 13.2 Training and test times for NB.modetrainingtestingtime complexityΘ(|D | Lave + |C ||V |)Θ( La + |C | Ma ) = Θ(|C | Ma )We have now introduced all the elements we need for training and applying an NB classifier.

The complete algorithm is described in Figure 13.2.✎Example 13.1: For the example in Table 13.1, the multinomial parameters weneed to classify the test document are the priors P̂ (c) = 3/4 and P̂ (c) = 1/4 and thefollowing conditional probabilities:=(5 + 1)/(8 + 6) = 6/14 = 3/7P̂ (Tokyo| c) = P̂ (Japan| c)=(0 + 1)/(8 + 6) = 1/14=(1 + 1)/(3 + 6) = 2/9P̂ (Tokyo| c) = P̂ (Japan| c)=(1 + 1)/(3 + 6) = 2/9P̂ (Chinese| c)P̂ (Chinese| c)The denominators are (8 + 6) and (3 + 6) because the lengths of textc and textc are 8and 3, respectively, and because the constant B in Equation (13.7) is 6 as the vocabulary consists of six terms.We then get:P̂ (c| d5 )P̂ (c | d5 )∝∝3/4 · (3/7)3 · 1/14 · 1/14 ≈ 0.0003.1/4 · (2/9)3 · 2/9 · 2/9 ≈ 0.0001.Thus, the classifier assigns the test document to c = China.

The reason for this classification decision is that the three occurrences of the positive indicator Chinese in d5outweigh the occurrences of the two negative indicators Japan and Tokyo.What is the time complexity of NB? The complexity of computing the parameters is Θ(|C ||V |) because the set of parameters consists of |C ||V | conditional probabilities and |C | priors.

The preprocessing necessary for computing the parameters (extracting the vocabulary, counting terms, etc.) canbe done in one pass through the training data. The time complexity of thisOnline edition (c) 2009 Cambridge UP26213 Text classification and Naive Bayescomponent is therefore Θ(|D | Lave ), where |D | is the number of documentsand Lave is the average length of a document.We use Θ(|D | Lave ) as a notation for Θ( T ) here, where T is the length of thetraining collection. This is nonstandard; Θ(.) is not defined for an average.We prefer expressing the time complexity in terms of D and Lave becausethese are the primary statistics used to characterize training collections.The time complexity of A PPLY M ULTINOMIAL NB in Figure 13.2 is Θ(|C | La ).La and Ma are the numbers of tokens and types, respectively, in the test document. A PPLY M ULTINOMIAL NB can be modified to be Θ( La + |C | Ma ) (Exercise 13.8).

Finally, assuming that the length of test documents is bounded,Θ( La + |C | Ma ) = Θ(|C | Ma ) because La < b|C | Ma for a fixed constant b.2Table 13.2 summarizes the time complexities. In general, we have |C ||V | <|D | Lave , so both training and testing complexity are linear in the time it takesto scan the data. Because we have to look at the data at least once, NB can besaid to have optimal time complexity. Its efficiency is one reason why NB isa popular text classification method.13.2.1Relation to multinomial unigram language modelThe multinomial NB model is formally identical to the multinomial unigramlanguage model (Section 12.2.1, page 242).

In particular, Equation (13.2) isa special case of Equation (12.12) from page 243, which we repeat here forλ = 1:P ( d | q ) ∝ P ( d ) ∏ P ( t | Md ) .(13.8)t∈qThe document d in text classification (Equation (13.2)) takes the role of thequery in language modeling (Equation (13.8)) and the classes c in text classification take the role of the documents d in language modeling. We usedEquation (13.8) to rank documents according to the probability that they arerelevant to the query q. In NB classification, we are usually only interestedin the top-ranked class.We also used MLE estimates in Section 12.2.2 (page 243) and encounteredthe problem of zero estimates owing to sparse data (page 244); but insteadof add-one smoothing, we used a mixture of two distributions to address theproblem there. Add-one smoothing is closely related to add- 21 smoothing inSection 11.3.4 (page 228).?Exercise 13.1Why is |C ||V | < |D | Lave in Table 13.2 expected to hold for most text collections?2.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.