An introduction to information retrieval. Manning_ Raghavan (2009) (811397)

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf)An introduction to information retrieval. Manning_ Raghavan (2009) (811397)2020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла

AnIntroductiontoInformationRetrievalDraft of April 1, 2009Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPAnIntroductiontoInformationRetrievalChristopher D. ManningPrabhakar RaghavanHinrich SchützeCambridge University PressCambridge, EnglandOnline edition (c) 2009 Cambridge UPDRAFT!DO NOT DISTRIBUTE WITHOUT PRIOR PERMISSION© 2009 Cambridge University PressBy Christopher D. Manning, Prabhakar Raghavan & Hinrich SchützePrinted on April 1, 2009Website: http://www.informationretrieval.org/Comments, corrections, and other feedback most welcome at:informationretrieval@yahoogroups.comOnline edition (c) 2009 Cambridge UPvDRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.Brief Contents1234567891011121314151617181920211Boolean retrievalThe term vocabulary and postings lists19Dictionaries and tolerant retrieval49Index construction67Index compression85Scoring, term weighting and the vector space model109Computing scores in a complete search system135Evaluation in information retrieval151Relevance feedback and query expansion177XML retrieval195Probabilistic information retrieval219Language models for information retrieval237Text classification and Naive Bayes253Vector space classification289Support vector machines and machine learning on documentsFlat clustering349Hierarchical clustering377Matrix decompositions and latent semantic indexing403Web search basics421Web crawling and indexes443Link analysis461Online edition (c) 2009 Cambridge UP319Online edition (c) 2009 Cambridge UPviiDRAFT! © April 1, 2009 Cambridge University Press.

Feedback welcome.ContentsList of TablesList of FiguresxvxixTable of NotationPrefacexxxi1 Boolean retrieval1.11.21.31.41.5xxvii1An example information retrieval problem3A first take at building an inverted index6Processing Boolean queries10The extended Boolean model versus ranked retrievalReferences and further reading172 The term vocabulary and postings lists2.12.22.32.42.51419Document delineation and character sequence decoding2.1.1Obtaining the character sequence in a document2.1.2Choosing a document unit20Determining the vocabulary of terms222.2.1Tokenization222.2.2Dropping common terms: stop words272.2.3Normalization (equivalence classing of terms)2.2.4Stemming and lemmatization32Faster postings list intersection via skip pointers36Positional postings and phrase queries392.4.1Biword indexes392.4.2Positional indexes412.4.3Combination schemes43References and further reading45Online edition (c) 2009 Cambridge UP191928viiiContents493 Dictionaries and tolerant retrieval3.1 Search structures for dictionaries493.2 Wildcard queries513.2.1General wildcard queries533.2.2k-gram indexes for wildcard queries543.3 Spelling correction563.3.1Implementing spelling correction573.3.2Forms of spelling correction573.3.3Edit distance583.3.4k-gram indexes for spelling correction603.3.5Context sensitive spelling correction623.4 Phonetic correction633.5 References and further reading654 Index construction674.1 Hardware basics684.2 Blocked sort-based indexing694.3 Single-pass in-memory indexing734.4 Distributed indexing744.5 Dynamic indexing784.6 Other types of indexes804.7 References and further reading835 Index compression855.1 Statistical properties of terms in information retrieval5.1.1Heaps’ law: Estimating the number of terms5.1.2Zipf’s law: Modeling the distribution of terms5.2 Dictionary compression905.2.1Dictionary as a string915.2.2Blocked storage925.3 Postings file compression955.3.1Variable byte codes965.3.2γ codes985.4 References and further reading1056 Scoring, term weighting and the vector space model6.1 Parametric and zone indexes1106.1.1Weighted zone scoring1126.1.2Learning weights1136.1.3The optimal weight g1156.2 Term frequency and weighting1176.2.1Inverse document frequency1176.2.2Tf-idf weighting118Online edition (c) 2009 Cambridge UP109868889ixContents6.36.46.5120The vector space model for scoring6.3.1Dot products1206.3.2Queries as vectors1236.3.3Computing vector scores124Variant tf-idf functions1266.4.1Sublinear tf scaling1266.4.2Maximum tf normalization1276.4.3Document and query weighting schemes1286.4.4Pivoted normalized document length129References and further reading1337 Computing scores in a complete search system7.17.27.37.4Efficient scoring and ranking1357.1.1Inexact top K document retrieval1377.1.2Index elimination1377.1.3Champion lists1387.1.4Static quality scores and ordering1387.1.5Impact ordering1407.1.6Cluster pruning141Components of an information retrieval system1437.2.1Tiered indexes1437.2.2Query-term proximity1447.2.3Designing parsing and scoring functions1457.2.4Putting it all together146Vector space scoring and query operator interaction147References and further reading1498 Evaluation in information retrieval8.18.28.38.48.58.68.78.8135151Information retrieval system evaluation152Standard test collections153Evaluation of unranked retrieval sets154Evaluation of ranked retrieval results158Assessing relevance1648.5.1Critiques and justifications of the concept ofrelevance166A broader perspective: System quality and user utility8.6.1System issues1688.6.2User utility1698.6.3Refining a deployed system170Results snippets170References and further reading1739 Relevance feedback and query expansion177Online edition (c) 2009 Cambridge UP168xContents9.19.29.3178Relevance feedback and pseudo relevance feedback9.1.1The Rocchio algorithm for relevance feedback1789.1.2Probabilistic relevance feedback1839.1.3When does relevance feedback work?1839.1.4Relevance feedback on the web1859.1.5Evaluation of relevance feedback strategies1869.1.6Pseudo relevance feedback1879.1.7Indirect relevance feedback1879.1.8Summary188Global methods for query reformulation1899.2.1Vocabulary tools for query reformulation1899.2.2Query expansion1899.2.3Automatic thesaurus generation192References and further reading19310 XML retrieval19510.1 Basic XML concepts19710.2 Challenges in XML retrieval20110.3 A vector space model for XML retrieval20610.4 Evaluation of XML retrieval21010.5 Text-centric vs.

data-centric XML retrieval21410.6 References and further reading21610.7 Exercises21711 Probabilistic information retrieval21911.1 Review of basic probability theory22011.2 The Probability Ranking Principle22111.2.1 The 1/0 loss case22111.2.2 The PRP with retrieval costs22211.3 The Binary Independence Model22211.3.1 Deriving a ranking function for query terms22411.3.2 Probability estimates in theory22611.3.3 Probability estimates in practice22711.3.4 Probabilistic approaches to relevance feedback22811.4 An appraisal and some extensions23011.4.1 An appraisal of probabilistic models23011.4.2 Tree-structured dependencies between terms23111.4.3 Okapi BM25: a non-binary model23211.4.4 Bayesian network approaches to IR23411.5 References and further reading23512 Language models for information retrieval12.1 Language models237237Online edition (c) 2009 Cambridge UPxiContents12.212.312.412.523712.1.1 Finite automata and language models12.1.2 Types of language models24012.1.3 Multinomial distributions over words241The query likelihood model24212.2.1 Using query likelihood language models in IR24212.2.2 Estimating the query generation probability24312.2.3 Ponte and Croft’s Experiments246Language modeling versus other approaches in IR248Extended language modeling approaches250References and further reading25213 Text classification and Naive Bayes25313.1 The text classification problem25613.2 Naive Bayes text classification25813.2.1 Relation to multinomial unigram language model13.3 The Bernoulli model26313.4 Properties of Naive Bayes26513.4.1 A variant of the multinomial model27013.5 Feature selection27113.5.1 Mutual information27213.5.2 χ2 Feature selection27513.5.3 Frequency-based feature selection27713.5.4 Feature selection for multiple classifiers27813.5.5 Comparison of feature selection methods27813.6 Evaluation of text classification27913.7 References and further reading28626214 Vector space classification28914.1 Document representations and measures of relatedness invector spaces29114.2 Rocchio classification29214.3 k nearest neighbor29714.3.1 Time complexity and optimality of kNN29914.4 Linear versus nonlinear classifiers30114.5 Classification with more than two classes30614.6 The bias-variance tradeoff30814.7 References and further reading31414.8 Exercises31515 Support vector machines and machine learning on documents31915.1 Support vector machines: The linearly separable case32015.2 Extensions to the SVM model32715.2.1 Soft margin classification327Online edition (c) 2009 Cambridge UPxiiContents33015.2.2 Multiclass SVMs15.2.3 Nonlinear SVMs33015.2.4 Experimental results33315.3 Issues in the classification of text documents33415.3.1 Choosing what kind of classifier to use33515.3.2 Improving classifier performance33715.4 Machine learning methods in ad hoc information retrieval34115.4.1 A simple example of machine-learned scoring34115.4.2 Result ranking by machine learning34415.5 References and further reading34616 Flat clustering34916.1 Clustering in information retrieval35016.2 Problem statement35416.2.1 Cardinality – the number of clusters35516.3 Evaluation of clustering35616.4 K-means36016.4.1 Cluster cardinality in K-means36516.5 Model-based clustering36816.6 References and further reading37216.7 Exercises37417 Hierarchical clustering37717.1 Hierarchical agglomerative clustering37817.2 Single-link and complete-link clustering38217.2.1 Time complexity of HAC38517.3 Group-average agglomerative clustering38817.4 Centroid clustering39117.5 Optimality of HAC39317.6 Divisive clustering39517.7 Cluster labeling39617.8 Implementation notes39817.9 References and further reading39917.10 Exercises40118 Matrix decompositions and latent semantic indexing18.1 Linear algebra review40318.1.1 Matrix decompositions40618.2 Term-document matrices and singular valuedecompositions40718.3 Low-rank approximations41018.4 Latent semantic indexing41218.5 References and further reading417Online edition (c) 2009 Cambridge UP403xiiiContents42119 Web search basics19.1 Background and history42119.2 Web characteristics42319.2.1 The web graph42519.2.2 Spam42719.3 Advertising as the economic model42919.4 The search user experience43219.4.1 User query needs43219.5 Index size and estimation43319.6 Near-duplicates and shingling43719.7 References and further reading44120 Web crawling and indexes44320.1 Overview44320.1.1 Features a crawler must provide20.1.2 Features a crawler should provide20.2 Crawling44420.2.1 Crawler architecture44520.2.2 DNS resolution44920.2.3 The URL frontier45120.3 Distributing indexes45420.4 Connectivity servers45520.5 References and further reading45844344421 Link analysis46121.1 The Web as a graph46221.1.1 Anchor text and the web graph46221.2 PageRank46421.2.1 Markov chains46521.2.2 The PageRank computation46821.2.3 Topic-specific PageRank47121.3 Hubs and Authorities47421.3.1 Choosing the subset of the Web47721.4 References and further reading480Bibliography483Author Index519Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Тип файла PDF

PDF-формат наиболее широко используется для просмотра любого типа файлов на любом устройстве. В него можно сохранить документ, таблицы, презентацию, текст, чертежи, вычисления, графики и всё остальное, что можно показать на экране любого устройства. Именно его лучше всего использовать для печати.

Например, если Вам нужно распечатать чертёж из автокада, Вы сохраните чертёж на флешку, но будет ли автокад в пункте печати? А если будет, то нужная версия с нужными библиотеками? Именно для этого и нужен формат PDF - в нём точно будет показано верно вне зависимости от того, в какой программе создали PDF-файл и есть ли нужная программа для его просмотра.

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.