An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 82

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 82 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 822020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 82)

Much of this work can be used to suggest zonesthat may be distinctively useful for text classification. For example Kołczet al. (2000) consider a form of feature selection where you classify documents based only on words in certain zones. Based on text summarizationresearch, they consider using (i) only the title, (ii) only the first paragraph,(iii) only the paragraph with the most title words or keywords, (iv) the firsttwo paragraphs or the first and last paragraph, or (v) all sentences with aminimum number of title words or keywords.

In general, these positionalfeature selection methods produced as good results as mutual information(Section 13.5.1), and resulted in quite competitive classifiers. Ko et al. (2004)also took inspiration from text summarization research to upweight sentences with either words from the title or words that are central to the document’s content, leading to classification accuracy gains of almost 1%. ThisOnline edition (c) 2009 Cambridge UP15.4 Machine learning methods in ad hoc information retrieval341presumably works because most such sentences are somehow more centralto the concerns of the document.?Exercise 15.4[⋆⋆]Spam email often makes use of various cloaking techniques to try to get through.

Onemethod is to pad or substitute characters so as to defeat word-based text classifiers.For example, you see terms like the following in spam email:Rep1icaRolexPHARlbdMACYbonmus[LEV]i[IT]l[RA]Viiiaaaagrase∧xualpi11zClAfLlSDiscuss how you could engineer features that would largely defeat this strategy.Exercise 15.5[⋆⋆]Another strategy often used by purveyors of email spam is to follow the messagethey wish to send (such as buying a cheap stock or whatever) with a paragraph oftext from another innocuous source (such as a news article).

Why might this strategybe effective? How might it be addressed by a text classifier?Exercise 15.6[ ⋆]What other kinds of features appear as if they would be useful in an email spamclassifier?15.4Machine learning methods in ad hoc information retrievalRather than coming up with term and document weighting functions byhand, as we primarily did in Chapter 6, we can view different sources of relevance signal (cosine score, title match, etc.) as features in a learning problem.A classifier that has been fed examples of relevant and nonrelevant documents for each of a set of queries can then figure out the relative weightsof these signals. If we configure the problem so that there are pairs of adocument and a query which are assigned a relevance judgment of relevantor nonrelevant, then we can think of this problem too as a text classificationproblem.

Taking such a classification approach is not necessarily best, andwe present an alternative in Section 15.4.2. Nevertheless, given the materialwe have covered, the simplest place to start is to approach this problem asa classification problem, by ordering the documents according to the confidence of a two-class classifier in its relevance decision. And this move is notpurely pedagogical; exactly this approach is sometimes used in practice.15.4.1A simple example of machine-learned scoringIn this section we generalize the methodology of Section 6.1.2 (page 113) tomachine learning of the scoring function.

In Section 6.1.2 we considered acase where we had to combine Boolean indicators of relevance; here we consider more general factors to further develop the notion of machine-learnedOnline edition (c) 2009 Cambridge UP34215 Support vector machines and machine learning on documentsExampleΦ1Φ2Φ3Φ4Φ5Φ6Φ7···DocID3737238238174120943191···Querylinux operating systempenguin logooperating systemruntime environmentkernel layerdevice driverdevice driver···Cosine score0.0320.020.0430.0040.0220.030.027···ω3422325···Judgmentrelevantnonrelevantrelevantnonrelevantrelevantrelevantnonrelevant···◮ Table 15.3 Training examples for machine-learned scoring.relevance. In particular, the factors we now consider go beyond Booleanfunctions of query term presence in document zones, as in Section 6.1.2.We develop the ideas in a setting where the scoring function is a linearcombination of two factors: (1) the vector space cosine similarity betweenquery and document and (2) the minimum window width ω within whichthe query terms lie.

As we noted in Section 7.2.2 (page 144), query termproximity is often very indicative of a document being on topic, especiallywith longer documents and on the web. Among other things, this quantitygives us an implementation of implicit phrases. Thus we have one factor thatdepends on the statistics of query terms in the document as a bag of words,and another that depends on proximity weighting. We consider only twofeatures in the development of the ideas because a two-feature expositionremains simple enough to visualize. The technique can be generalized tomany more features.As in Section 6.1.2, we are provided with a set of training examples, eachof which is a pair consisting of a query and a document, together with arelevance judgment for that document on that query that is either relevant ornonrelevant. For each such example we can compute the vector space cosinesimilarity, as well as the window width ω.

The result is a training set asshown in Table 15.3, which resembles Figure 6.5 (page 115) from Section 6.1.2.Here, the two features (cosine score denoted α and window width ω) arereal-valued predictors. If we once again quantify the judgment relevant as 1and nonrelevant as 0, we seek a scoring function that combines the values ofthe features to generate a value that is (close to) 0 or 1. We wish this function to be in agreement with our set of training examples as far as possible.Without loss of generality, a linear classifier will use a linear combination offeatures of the form(15.17)Score(d, q) = Score(α, ω ) = aα + bω + c,with the coefficients a, b, c to be learned from the training data.

While it isOnline edition (c) 2009 Cambridge UP34315.4 Machine learning methods in ad hoc information retrieval0.05eroRcRNsRReRRinRsocN0.02N5RRRNRNNNNNN0234Termp5roximity◮ Figure 15.7 A collection of training examples. Each R denotes a training examplelabeled relevant, while each N is a training example labeled nonrelevant.possible to formulate this as an error minimization problem as we did inSection 6.1.2, it is instructive to visualize the geometry of Equation (15.17).The examples in Table 15.3 can be plotted on a two-dimensional plane withaxes corresponding to the cosine score α and the window width ω. This isdepicted in Figure 15.7.In this setting, the function Score(α, ω ) from Equation (15.17) representsa plane “hanging above” Figure 15.7.

Ideally this plane (in the directionperpendicular to the page containing Figure 15.7) assumes values close to1 above the points marked R, and values close to 0 above the points markedN. Since a plane is unlikely to assume only values close to 0 or 1 above thetraining sample points, we make use of thresholding: given any query anddocument for which we wish to determine relevance, we pick a value θ andif Score(α, ω ) > θ we declare the document to be relevant, else we declarethe document to be nonrelevant. As we know from Figure 14.8 (page 301),all points that satisfy Score(α, ω ) = θ form a line (shown as a dashed linein Figure 15.7) and we thus have a linear classifier that separates relevantOnline edition (c) 2009 Cambridge UP34415 Support vector machines and machine learning on documentsfrom nonrelevant instances. Geometrically, we can find the separating lineas follows.

Consider the line passing through the plane Score(α, ω ) whoseheight is θ above the page containing Figure 15.7. Project this line down ontoFigure 15.7; this will be the dashed line in Figure 15.7. Then, any subsequent query/document pair that falls below the dashed line in Figure 15.7 isdeemed nonrelevant; above the dashed line, relevant.Thus, the problem of making a binary relevant/nonrelevant judgment giventraining examples as above turns into one of learning the dashed line in Figure 15.7 separating relevant training examples from the nonrelevant ones. Being in the α-ω plane, this line can be written as a linear equation involvingα and ω, with two parameters (slope and intercept). The methods of linear classification that we have already looked at in Chapters 13–15 providemethods for choosing this line.

Provided we can build a sufficiently rich collection of training samples, we can thus altogether avoid hand-tuning scorefunctions as in Section 7.2.3 (page 145). The bottleneck of course is the abilityto maintain a suitably representative set of training examples, whose relevance assessments must be made by experts.15.4.2REGRESSIONORDINAL REGRESSIONResult ranking by machine learningThe above ideas can be readily generalized to functions of many more thantwo variables. There are lots of other scores that are indicative of the relevance of a document to a query, including static quality (PageRank-stylemeasures, discussed in Chapter 21), document age, zone contributions, document length, and so on.

Providing that these measures can be calculatedfor a training document collection with relevance judgments, any numberof such measures can be used to train a machine learning classifier. For instance, we could train an SVM over binary relevance judgments, and orderdocuments based on their probability of relevance, which is monotonic withthe documents’ signed distance from the decision boundary.However, approaching IR result ranking like this is not necessarily theright way to think about the problem. Statisticians normally first divideproblems into classification problems (where a categorical variable is predicted) versus regression problems (where a real number is predicted). Inbetween is the specialized field of ordinal regression where a ranking is predicted.

Machine learning for ad hoc retrieval is most properly thought of asan ordinal regression problem, where the goal is to rank a set of documentsfor a query, given training data of the same sort. This formulation givessome additional power, since documents can be evaluated relative to othercandidate documents for the same query, rather than having to be mappedto a global scale of goodness, while also weakening the problem space, sincejust a ranking is required rather than an absolute measure of relevance. Issues of ranking are especially germane in web search, where the ranking atOnline edition (c) 2009 Cambridge UP15.4 Machine learning methods in ad hoc information retrievalRANKINGSVM345the very top of the results list is exceedingly important, whereas decisionsof relevance of a document to a query may be much less important.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.