An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 65

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 65 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 652020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 65)

This again causes problemsin estimation owing to data sparseness.For these reasons, we make a second independence assumption for themultinomial model, positional independence: The conditional probabilities fora term are the same independent of position in the document.P ( Xk 1 = t | c ) = P ( Xk 2 = t | c )for all positions k1 , k2 , terms t and classes c. Thus, we have a single distribution of terms that is valid for all positions k i and we can use X as itssymbol.4 Positional independence is equivalent to adopting the bag of wordsmodel, which we introduced in the context of ad hoc retrieval in Chapter 6(page 117).With conditional and positional independence assumptions, we only needto estimate Θ( M |C |) parameters P(tk |c) (multinomial model) or P(ei |c) (Bernoulli4.

Our terminology is nonstandard. The random variable X is a categorical variable, not a multinomial variable, and the corresponding NB model should perhaps be called a sequence model. Wehave chosen to present this sequence model and the multinomial model in Section 13.4.1 as thesame model because they are computationally identical.Online edition (c) 2009 Cambridge UP26813 Text classification and Naive Bayes◮ Table 13.3 Multinomial versus Bernoulli model.RANDOM VARIABLECevent modelrandom variable(s)document representationmultinomial modelgeneration of tokenX = t iff t occurs at given posd = h t1 , .

. . , t k , . . . , t n d i, t k ∈ Vparameter estimationdecision rule: maximizemultiple occurrenceslength of docs# featuresestimate for term theP̂( X = t|c)P̂(c) ∏1≤k≤nd P̂( X = tk |c)taken into accountcan handle longer docscan handle moreP̂( X = the|c) ≈ 0.05Bernoulli modelgeneration of documentUt = 1 iff t occurs in docd = h e1 , . . . , e i , . . . , e M i ,ei ∈ {0, 1}P̂(Ui = e|c)P̂(c) ∏ti ∈V P̂(Ui = ei |c)ignoredworks best for short docsworks best with fewerP̂(U the = 1|c) ≈ 1.0model), one for each term–class combination, rather than a number that isat least exponential in M, the size of the vocabulary.

The independenceassumptions reduce the number of parameters to be estimated by severalorders of magnitude.To summarize, we generate a document in the multinomial model (Figure 13.4) by first picking a class C = c with P(c) where C is a random variabletaking values from C as values.

Next we generate term tk in position k withP( Xk = tk |c) for each of the nd positions of the document. The Xk all havethe same distribution over terms for a given c. In the example in Figure 13.4,we show the generation of ht1 , t2 , t3 , t4 , t5 i = hBeijing, and, Taipei, join, WTOi,corresponding to the one-sentence document Beijing and Taipei join WTO.For a completely specified document generation model, we would alsohave to define a distribution P(nd |c) over lengths. Without it, the multinomial model is a token generation model rather than a document generationmodel.We generate a document in the Bernoulli model (Figure 13.5) by first picking a class C = c with P(c) and then generating a binary indicator ei for eachterm ti of the vocabulary (1 ≤ i ≤ M).

In the example in Figure 13.5, weshow the generation of he1 , e2 , e3 , e4 , e5 , e6 i = h0, 1, 0, 1, 1, 1i, corresponding,again, to the one-sentence document Beijing and Taipei join WTO where wehave assumed that and is a stop word.We compare the two models in Table 13.3, including estimation equationsand decision rules.Naive Bayes is so called because the independence assumptions we havejust made are indeed very naive for a model of natural language. The conditional independence assumption states that features are independent of eachother given the class.

This is hardly ever true for terms in documents. Inmany cases, the opposite is true. The pairs hong and kong or london and en-Online edition (c) 2009 Cambridge UP26913.4 Properties of Naive Bayes◮ Table 13.4 Correct estimation implies accurate prediction, but accurate prediction does not imply correct estimation.true probability P(c|d)P̂(c) ∏1≤k≤nd P̂(tk |c) (Equation (13.13))NB estimate P̂(c|d)c10.60.000990.99c20.40.000010.01class selectedc1c1glish in Figure 13.7 are examples of highly dependent terms. In addition, theCONCEPT DRIFTmultinomial model makes an assumption of positional independence.

TheBernoulli model ignores positions in documents altogether because it onlycares about absence or presence. This bag-of-words model discards all information that is communicated by the order of words in natural languagesentences. How can NB be a good text classifier when its model of naturallanguage is so oversimplified?The answer is that even though the probability estimates of NB are of lowquality, its classification decisions are surprisingly good. Consider a documentd with true probabilities P(c1 |d) = 0.6 and P(c2 |d) = 0.4 as shown in Table 13.4.

Assume that d contains many terms that are positive indicators forc1 and many terms that are negative indicators for c2 . Thus, when using themultinomial model in Equation (13.13), P̂(c1 ) ∏1≤k≤nd P̂(tk |c1 ) will be muchlarger than P̂(c2 ) ∏1≤k≤nd P̂(tk |c2 ) (0.00099 vs. 0.00001 in the table). After division by 0.001 to get well-formed probabilities for P(c|d), we end up withone estimate that is close to 1.0 and one that is close to 0.0. This is common:The winning class in NB classification usually has a much larger probability than the other classes and the estimates diverge very significantly fromthe true probabilities.

But the classification decision is based on which classgets the highest score. It does not matter how accurate the estimates are. Despite the bad estimates, NB estimates a higher probability for c1 and thereforeassigns d to the correct class in Table 13.4. Correct estimation implies accurateprediction, but accurate prediction does not imply correct estimation. NB classifiersestimate badly, but often classify well.Even if it is not the method with the highest accuracy for text, NB has manyvirtues that make it a strong contender for text classification.

It excels if thereare many equally important features that jointly contribute to the classification decision. It is also somewhat robust to noise features (as defined inthe next section) and concept drift – the gradual change over time of the concept underlying a class like US president from Bill Clinton to George W. Bush(see Section 13.7). Classifiers like kNN (Section 14.3, page 297) can be carefully tuned to idiosyncratic properties of a particular time period.

This willthen hurt them when documents in the following time period have slightlyOnline edition (c) 2009 Cambridge UP27013 Text classification and Naive Bayes◮ Table 13.5problematic.(1)(2)(3)OPTIMAL CLASSIFIER13.4.1A set of documents for which the NB independence assumptions areHe moved from London, Ontario, to London, England.He moved from London, England, to London, Ontario.He moved from England to London, Ontario.different properties.The Bernoulli model is particularly robust with respect to concept drift.We will see in Figure 13.8 that it can have decent performance when usingfewer than a dozen terms.

The most important indicators for a class are lesslikely to change. Thus, a model that only relies on these features is morelikely to maintain a certain level of accuracy in concept drift.NB’s main strength is its efficiency: Training and classification can be accomplished with one pass over the data. Because it combines efficiency withgood accuracy it is often used as a baseline in text classification research.It is often the method of choice if (i) squeezing out a few extra percentagepoints of accuracy is not worth the trouble in a text classification application,(ii) a very large amount of training data is available and there is more to begained from training on a lot of data than using a better classifier on a smallertraining set, or (iii) if its robustness to concept drift can be exploited.In this book, we discuss NB as a classifier for text. The independence assumptions do not hold for text.

However, it can be shown that NB is anoptimal classifier (in the sense of minimal error rate on new data) for datawhere the independence assumptions do hold.A variant of the multinomial modelAn alternative formalization of the multinomial model represents each document d as an M-dimensional vector of counts htft1 ,d , . . . , tft M ,d i where tfti ,dis the term frequency of ti in d. P(d|c) is then computed as follows (cf.

Equation (12.8), page 243);(13.15)P(d|c) = P(htft1 ,d , . . . , tft M ,d i|c) ∝∏1≤i ≤ MP( X = ti |c)tfti ,dNote that we have omitted the multinomial factor. See Equation (12.8) (page 243).Equation (13.15) is equivalent to the sequence model in Equation (13.2) astfP( X = ti |c) ti ,d = 1 for terms that do not occur in d (tfti ,d = 0) and a termthat occurs tfti ,d ≥ 1 times will contribute tfti ,d factors both in Equation (13.2)and in Equation (13.15).Online edition (c) 2009 Cambridge UP13.5 Feature selection271S ELECT F EATURES(D, c, k)1 V ← E XTRACT V OCABULARY (D )2 L ← []3 for each t ∈ V4 do A(t, c) ← C OMPUTE F EATURE U TILITY (D, t, c)5A PPEND ( L, h A(t, c), ti)6 return F EATURES W ITH L ARGEST VALUES( L, k)◮ Figure 13.6 Basic feature selection algorithm for selecting the k best features.?Exercise 13.2[ ⋆]Which of the documents in Table 13.5 have identical and different bag of words representations for (i) the Bernoulli model (ii) the multinomial model? If there are differences, describe them.Exercise 13.3The rationale for the positional independence assumption is that there is no usefulinformation in the fact that a term occurs in position k of a document.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.