An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 66

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 66 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 662020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 66)

Find exceptions.Consider formulaic documents with a fixed document structure.Exercise 13.4Table 13.3 gives Bernoulli and multinomial estimates for the word the. Explain thedifference.13.5FEATURE SELECTIONNOISE FEATUREOVERFITTINGFeature selectionFeature selection is the process of selecting a subset of the terms occurringin the training set and using only this subset as features in text classification.

Feature selection serves two main purposes. First, it makes trainingand applying a classifier more efficient by decreasing the size of the effectivevocabulary. This is of particular importance for classifiers that, unlike NB,are expensive to train. Second, feature selection often increases classification accuracy by eliminating noise features. A noise feature is one that, whenadded to the document representation, increases the classification error onnew data. Suppose a rare term, say arachnocentric, has no information abouta class, say China, but all instances of arachnocentric happen to occur in Chinadocuments in our training set.

Then the learning method might produce aclassifier that misassigns test documents containing arachnocentric to China.Such an incorrect generalization from an accidental property of the trainingset is called overfitting.We can view feature selection as a method for replacing a complex classifier (using all features) with a simpler one (using a subset of the features).Online edition (c) 2009 Cambridge UP27213 Text classification and Naive BayesIt may appear counterintuitive at first that a seemingly weaker classifier isadvantageous in statistical text classification, but when discussing the biasvariance tradeoff in Section 14.6 (page 308), we will see that weaker modelsare often preferable when limited training data are available.The basic feature selection algorithm is shown in Figure 13.6.

For a givenclass c, we compute a utility measure A(t, c) for each term of the vocabularyand select the k terms that have the highest values of A(t, c). All other termsare discarded and not used in classification. We will introduce three differentutility measures in this section: mutual information, A(t, c) = I (Ut ; Cc ); theχ2 test, A(t, c) = X 2 (t, c); and frequency, A(t, c) = N (t, c).Of the two NB models, the Bernoulli model is particularly sensitive tonoise features. A Bernoulli NB classifier requires some form of feature selection or else its accuracy will be low.This section mainly addresses feature selection for two-class classificationtasks like China versus not-China.

Section 13.5.5 briefly discusses optimizations for systems with more than two classes.13.5.1MUTUAL INFORMATION(13.16)Mutual informationA common feature selection method is to compute A(t, c) as the expectedmutual information (MI) of term t and class c.5 MI measures how much information the presence/absence of a term contributes to making the correctclassification decision on c. Formally:I (U; C )=∑∑e t ∈{1,0} e c ∈{1,0}P(U = et , C = ec ) log2P (U = et , C = ec ),P (U = et ) P ( C = ec )where U is a random variable that takes values et = 1 (the document containsterm t) and et = 0 (the document does not contain t), as defined on page 266,and C is a random variable that takes values ec = 1 (the document is in classc) and ec = 0 (the document is not in class c).

We write Ut and Cc if it is notclear from context which term t and class c we are referring to.ForMLEs of the probabilities, Equation (13.16) is equivalent to Equation (13.17):(13.17)I (U; C )=N N11NN N01N11+ 01 log2log2NN1. N.1NN0. N.1N N10NN N00N+ 00 log2+ 10 log2NN1. N.0NN0. N.0where the Ns are counts of documents that have the values of et and ec thatare indicated by the two subscripts. For example, N10 is the number of doc5.

Take care not to confuse expected mutual information with pointwise mutual information,which is defined as log N11 /E11 where N11 and E11 are defined as in Equation (13.18). Thetwo measures have different properties. See Section 13.7.Online edition (c) 2009 Cambridge UP27313.5 Feature selectionuments that contain t (et = 1) and are not in c (ec = 0). N1. = N10 + N11 isthe number of documents that contain t (et = 1) and we count documentsindependent of class membership (ec ∈ {0, 1}).

N = N00 + N01 + N10 + N11is the total number of documents. An example of one of the MLE estimatesthat transform Equation (13.16) into Equation (13.17) is P(U = 1, C = 1) =N11 /N.✎Example 13.3: Consider the class poultry and the term export in Reuters-RCV1.The counts of the number of documents with the four possible combinations of indicator values are as follows:ec = epoultry = 1 ec = epoultry = 0et = eexport = 1N11 = 49N10 = 27,652et = eexport = 0N01 = 141N00 = 774,106After plugging these values into Equation (13.17) we get:I (U; C )=≈49801,948 · 49log2801,948(49 + 27,652)(49 + 141)141801,948 · 141+log2801,948(141 + 774,106)(49 + 141)27,652801,948 · 27,652+log2801,948(49 + 27,652)(27,652 + 774,106)801,948 · 774,106774,106log2+801,948(141 + 774,106)(27,652 + 774,106)0.0001105To select k terms t1 , .

. . , tk for a given class, we use the feature selection algorithm in Figure 13.6: We compute the utility measure as A(t, c) = I (Ut , Cc )and select the k terms with the largest values.Mutual information measures how much information – in the informationtheoretic sense – a term contains about the class. If a term’s distribution isthe same in the class as it is in the collection as a whole, then I (U; C ) =0. MI reaches its maximum value if the term is a perfect indicator for classmembership, that is, if the term is present in a document if and only if thedocument is in the class.Figure 13.7 shows terms with high mutual information scores for the sixclasses in Figure 13.1.6 The selected terms (e.g., london, uk, british for the classUK) are of obvious utility for making classification decisions for their respective classes.

At the bottom of the list for UK we find terms like peripheralsand tonight (not shown in the figure) that are clearly not helpful in deciding6. Feature scores were computed on the first 100,000 documents, except for poultry, a rare class,for which 800,000 documents were used. We have omitted numbers and other special wordsfrom the top ten lists.Online edition (c) 2009 Cambridge UP27413 Text classification and Naive BayesUK0.19250.07550.05960.05550.04690.03570.02380.02120.01490.0126coffeecoffee0.0111bags0.0042growers0.0025kg0.0019colombia0.0018brazil0.0016export0.0014exporters 0.0013exports0.0013crop0.0012londonukbritishstgbritainplcenglandpencepoundsenglishChina0.09970.05230.04440.03440.02920.01980.01950.01550.01170.0108electionselection0.0519elections0.0342polls0.0339voters0.0315party0.0303vote0.0299poll0.0225candidate0.0202campaign0.0202democratic 0.0198chinachinesebeijingyuanshanghaihongkongxinhuaprovincetaiwanpoultry0.00130.00080.00060.00050.00040.00030.00030.00030.00030.0003sportssoccer0.0681cup0.0515match0.0441matches 0.0408played0.0388league0.0386beat0.0301game0.0299games0.0284team0.0264poultrymeatchickenagricultureavianbroilerveterinarybirdsinspectionpathogenic◮ Figure 13.7 Features with high mutual information scores for six Reuters-RCV1classes.whether the document is in the class.

As you might expect, keeping the informative terms and eliminating the non-informative ones tends to reducenoise and improve the classifier’s accuracy.Such an accuracy increase can be observed in Figure 13.8, which showsF1 as a function of vocabulary size after feature selection for Reuters-RCV1.7Comparing F1 at 132,776 features (corresponding to selection of all features)and at 10–100 features, we see that MI feature selection increases F1 by about0.1 for the multinomial model and by more than 0.2 for the Bernoulli model.For the Bernoulli model, F1 peaks early, at ten features selected. At that point,the Bernoulli model is better than the multinomial model. When basing aclassification decision on only a few features, it is more robust to consider binary occurrence only. For the multinomial model (MI feature selection), thepeak occurs later, at 100 features, and its effectiveness recovers somewhat at7.

We trained the classifiers on the first 100,000 documents and computed F1 on the next 100,000.The graphs are averages over five classes.Online edition (c) 2009 Cambridge UP275bb#bb0.4x bxxb#oob#obxoxxxxbxx#obxbo##xo#xo#bb0.2o##0.0F1 measure0.60.813.5 Feature selection#oo oo# #ox#b1#oxb101001000multinomial, MImultinomial, chisquaremultinomial, frequencybinomial, MI10000number of features selected◮ Figure 13.8 Effect of feature set size on accuracy for multinomial and Bernoullimodels.the end when we use all features. The reason is that the multinomial takesthe number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than theBernoulli model.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.