An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 2
Текст из файла (страница 2)
Feedback welcome.xvList of Tables4.14.24.34.45.15.25.35.4Typical system parameters in 2007. The seek time is the timeneeded to position the disk head in a new position. Thetransfer time per byte is the rate of transfer from disk tomemory when the head is in the right position.Collection statistics for Reuters-RCV1. Values are rounded forthe computations in this book. The unrounded values are:806,791 documents, 222 tokens per document, 391,523(distinct) terms, 6.04 bytes per token with spaces andpunctuation, 4.5 bytes per token without spaces andpunctuation, 7.5 bytes per term, and 96,969,056 tokens. Thenumbers in this table correspond to the third line (“casefolding”) in Table 5.1 (page 87).The five steps in constructing an index for Reuters-RCV1 inblocked sort-based indexing. Line numbers refer to Figure 4.2.Collection statistics for a large collection.The effect of preprocessing on the number of terms,nonpositional postings, and tokens for Reuters-RCV1.
“∆%”indicates the reduction in size from the previous line, exceptthat “30 stop words” and “150 stop words” both use “casefolding” as their reference line. “T%” is the cumulative(“total”) reduction from unfiltered. We performed stemmingwith the Porter stemmer (Chapter 2, page 33).Dictionary compression for Reuters-RCV1.Encoding gaps instead of document IDs. For example, westore gaps 107, 5, 43, . . . , instead of docIDs 283154, 283159,283202, . . .
for computer. The first docID is left unchanged(only shown for arachnocentric).VB encoding.Online edition (c) 2009 Cambridge UP6870828287959697xviList of Tables5.55.65.7Some examples of unary and γ codes. Unary codes are onlyshown for the smaller numbers. Commas in γ codes are forreadability only and are not part of the actual codes.Index and dictionary compression for Reuters-RCV1. Thecompression ratio depends on the proportion of actual text inthe collection. Reuters-RCV1 contains a large amount of XMLmarkup. Using the two best compression schemes, γencoding and blocking with front coding, the ratiocompressed index to collection size is therefore especiallysmall for Reuters-RCV1: (101 + 5.9)/3600 ≈ 0.03.Two gap sequences to be merged in blocked sort-basedindexing981031056.1Cosine computation for Exercise 6.19.1328.18.2Calculation of 11-point Interpolated Average Precision.Calculating the kappa statistic.15916510.1RDB (relational database) search, unstructured informationretrieval and structured information retrieval.INEX 2002 collection statistics.INEX 2002 results of the vector space model in Section 10.3 forcontent-and-structure (CAS) queries and the quantizationfunction Q.A comparison of content-only and full-structure search inINEX 2003/2004.10.210.310.413.113.213.313.413.513.613.7Data for parameter estimation examples.Training and test times for NB.Multinomial versus Bernoulli model.Correct estimation implies accurate prediction, but accurateprediction does not imply correct estimation.A set of documents for which the NB independenceassumptions are problematic.Critical values of the χ2 distribution with one degree offreedom.
For example, if the two events are independent,then P( X 2 > 6.63) < 0.01. So for X 2 > 6.63 the assumption ofindependence can be rejected with 99% confidence.The ten largest classes in the Reuters-21578 collection withnumber of documents in training and test sets.Online edition (c) 2009 Cambridge UP196211213214261261268269270277280List of TablesMacro- and microaveraging. “Truth” is the true class and“call” the decision of the classifier.
In this example,macroaveraged precision is[10/(10 + 10) + 90/(10 + 90)]/2 = (0.5 + 0.9)/2 = 0.7.Microaveraged precision is 100/(100 + 20) ≈ 0.83.13.9 Text classification effectiveness numbers on Reuters-21578 forF1 (in percent). Results from Li and Yang (2003) (a), Joachims(1998) (b: kNN) and Dumais et al. (1998) (b: NB, Rocchio,trees, SVM).13.10 Data for parameter estimation exercise.xvii13.828228228414.114.214.314.414.5Vectors and class centroids for the data in Table 13.1.Training and test times for Rocchio classification.Training and test times for kNN classification.A linear classifier.A confusion matrix for Reuters-21578.29429629930330815.1Training and testing complexity of various classifiersincluding SVMs.SVM classifier break-even F1 from (Joachims 2002a, p.
114).Training examples for machine-learned scoring.32933434215.215.316.116.235116.3Some applications of clustering in information retrieval.The four external evaluation measures applied to theclustering in Figure 16.4.The EM clustering algorithm.17.117.2Comparison of HAC algorithms.Automatically computed cluster labels.395397Online edition (c) 2009 Cambridge UP357371Online edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.xixList of Figures1.11.21.31.41.51.61.72.12.22.32.42.52.62.72.82.92.102.112.12A term-document incidence matrix.Results from Shakespeare for the query Brutus AND CaesarAND NOT Calpurnia.The two parts of an inverted index.Building an index by sorting and grouping.Intersecting the postings lists for Brutus and Calpurnia fromFigure 1.3.Algorithm for the intersection of two postings lists p1 and p2 .Algorithm for conjunctive queries that returns the set ofdocuments containing each term in the input list of terms.An example of a vocalized Modern Standard Arabic word.The conceptual linear order of characters is not necessarily theorder that you see on the page.The standard unsegmented form of Chinese text using thesimplified characters of mainland China.Ambiguities in Chinese word segmentation.A stop list of 25 semantically non-selective words which arecommon in Reuters-RCV1.An example of how asymmetric expansion of query terms canusefully model users’ expectations.Japanese makes use of multiple intermingled writing systemsand, like Chinese, does not segment words.A comparison of three stemming algorithms on a sample text.Postings lists with skip pointers.Postings lists intersection with skip pointers.Positional index example.An algorithm for proximity intersection of postings lists p1and p2 .Online edition (c) 2009 Cambridge UP4578101112212126262628313436374142xxList of Figures3.13.23.33.43.53.63.74.14.24.34.44.54.64.74.85.15.25.35.45.55.65.75.85.95.106.16.26.3A binary search tree.A B-tree.A portion of a permuterm index.Example of a postings list in a 3-gram index.Dynamic programming algorithm for computing the editdistance between strings s1 and s2 .Example Levenshtein distance computation.Matching at least two of the three 2-grams in the query bord.51525455Document from the Reuters newswire.Blocked sort-based indexing.Merging in blocked sort-based indexing.Inversion of a block in single-pass in-memory indexingAn example of distributed indexing with MapReduce.Adapted from Dean and Ghemawat (2004).Map and reduce functions in MapReduce.Logarithmic merging.
Each token (termID,docID) is initiallyadded to in-memory index Z0 by LM ERGE A DD T OKEN.L OGARITHMIC M ERGE initializes Z0 and indexes.A user-document matrix for access control lists. Element (i, j)is 1 if user i has access to document j and 0 otherwise.
Duringquery processing, a user’s access postings list is intersectedwith the results list returned by the text part of the index.70717273Heaps’ law.Zipf’s law for Reuters-RCV1.Storing the dictionary as an array of fixed-width entries.Dictionary-as-a-string storage.Blocked storage with four terms per block.Search of the uncompressed dictionary (a) and a dictionarycompressed by blocking with k = 4 (b).Front coding.VB encoding and decoding.Entropy H ( P) as a function of P( x1 ) for a sample space withtwo outcomes x1 and x2 .Stratification of terms for estimating the size of a γ encodedinverted index.8890919293Parametric search.Basic zone indexZone index in which the zone is encoded in the postingsrather than the dictionary.Online edition (c) 2009 Cambridge UP59596176777981949497100102111111111List of Figures6.46.56.66.76.86.96.106.116.126.136.146.156.166.17Algorithm for computing the weighted zone score from twopostings lists.An illustration of training examples.The four possible combinations of s T and s B .Collection frequency (cf) and document frequency (df) behavedifferently, as in this example from the Reuters collection.Example of idf values.Table of tf values for Exercise 6.10.Cosine similarity illustrated.Euclidean normalized tf values for documents in Figure 6.9.Term frequencies in three novels.Term vectors for the three novels of Figure 6.12.The basic algorithm for computing vector space scores.SMART notation for tf-idf variants.Pivoted document length normalization.Implementing pivoted document length normalization bylinear scaling.xxi1131151151181191201211221221231251281301317.17.27.37.47.5A faster algorithm for vector space scores.A static quality-ordered index.Cluster pruning.Tiered indexes.A complete search system.1361391421441478.18.28.3Graph comparing the harmonic mean to other means.Precision/recall graph.Averaged 11-point precision/recall graph across 50 queriesfor a representative TREC system.The ROC curve corresponding to the precision-recall curve inFigure 8.2.An example of selecting text for a dynamic snippet.157158Relevance feedback searching over images.Example of relevance feedback on a text collection.The Rocchio optimal query for separating relevant andnonrelevant documents.An application of Rocchio’s algorithm.Results showing pseudo relevance feedback greatlyimproving performance.An example of query expansion in the interface of the Yahoo!web search engine in 2006.Examples of query expansion via the PubMed thesaurus.An example of an automatically generated thesaurus.1791808.48.59.19.29.39.49.59.69.79.8Online edition (c) 2009 Cambridge UP160162172181182187190191192xxiiList of FiguresAn XML document.The XML document in Figure 10.1 as a simplified DOM object.An XML query in NEXI format and its partial representationas a tree.10.4 Tree representation of XML documents and queries.10.5 Partitioning an XML document into non-overlappingindexing units.10.6 Schema heterogeneity: intervening nodes and mismatchednames.10.7 A structural mismatch between two queries and a document.10.8 A mapping of an XML document (left) to a set of lexicalizedsubtrees (right).10.9 The algorithm for scoring documents with S IM N O M ERGE.10.10 Scoring of a query with one structural term in S IM N O M ERGE.10.11 Simplified schema of the documents in the INEX collection.19819811.1A tree of dependencies between terms.23212.1A simple finite automaton and some of the strings in thelanguage it generates.A one-state finite automaton that acts as a unigram languagemodel.Partial specification of two unigram language models.Results of a comparison of tf-idf with language modeling(LM) term weighting by Ponte and Croft (1998).Three ways of developing the language modeling approach:(a) query likelihood, (b) document likelihood, and (c) modelcomparison.10.110.210.312.212.312.412.513.113.219920020220420620720920921123823823924725025713.9Classes, training set, and test set in text classification .Naive Bayes algorithm (multinomial model): Training andtesting.NB algorithm (Bernoulli model): Training and testing.The multinomial NB model.The Bernoulli NB model.Basic feature selection algorithm for selecting the k best features.Features with high mutual information scores for sixReuters-RCV1 classes.Effect of feature set size on accuracy for multinomial andBernoulli models.A sample document from the Reuters-21578 collection.14.1Vector space classification into three classes.29013.313.413.513.613.713.8Online edition (c) 2009 Cambridge UP260263266267271274275281List of Figures14.214.314.414.514.614.714.814.914.1014.1114.1214.1314.1414.1515.115.215.315.415.515.615.716.116.216.316.416.516.616.716.817.117.2Projections of small areas of the unit sphere preserve distances.Rocchio classification.Rocchio classification: Training and testing.The multimodal class “a” consists of two different clusters(small upper circles centered on X’s).Voronoi tessellation and decision boundaries (double lines) in1NN classification.kNN training (with preprocessing) and testing.There are an infinite number of hyperplanes that separate twolinearly separable classes.Linear classification algorithm.A linear problem with noise.A nonlinear problem.J hyperplanes do not divide space into J disjoint regions.Arithmetic transformations for the bias-variance decomposition.Example for differences between Euclidean distance, dotproduct similarity and cosine similarity.A simple non-separable set of points.The support vectors are the 5 points right up against themargin of the classifier.An intuition for large-margin classification.The geometric margin of a point (r) and a decision boundary (ρ).A tiny 3 data point training set for an SVM.Large margin classification with slack variables.Projecting data that is not linearly separable into a higherdimensional space can make it linearly separable.A collection of training examples.An example of a data set with a clear cluster structure.Clustering of search results to improve recall.An example of a user session in Scatter-Gather.Purity as an external evaluation criterion for cluster quality.The K-means algorithm.A K-means example for K = 2 in R2 .The outcome of clustering in K-means depends on the initialseeds.Estimated minimal residual sum of squares as a function ofthe number of clusters in K-means.A dendrogram of a single-link clustering of 30 documentsfrom Reuters-RCV1.A simple, but inefficient HAC algorithm.Online edition (c) 2009 Cambridge UPxxiii291293295295297298301302304305307310316317320321323325327331343349352353357361362364366379381xxivList of Figures17.3The different notions of cluster similarity used by the fourHAC algorithms.17.4 A single-link (left) and complete-link (right) clustering ofeight documents.17.5 A dendrogram of a complete-link clustering.17.6 Chaining in single-link clustering.17.7 Outliers in complete-link clustering.17.8 The priority-queue algorithm for HAC.17.9 Single-link clustering algorithm using an NBM array.17.10 Complete-link clustering is not best-merge persistent.17.11 Three iterations of centroid clustering.17.12 Centroid clustering is not monotonic.18.118.238138238338438538638738839139240918.418.5Illustration of the singular-value decomposition.Illustration of low rank approximation using thesingular-value decomposition.The documents of Example 18.4 reduced to two dimensionsin (V ′ ) T .Documents for Exercise 18.11.Glossary for Exercise 18.11.19.119.219.319.419.519.619.719.819.9A dynamically generated web page.Two nodes of the web graph joined by a link.A sample small web graph.The bowtie structure of the Web.Cloaking as used by spammers.Search advertising triggered by query keywords.The various components of a web search engine.Illustration of shingle sketches.Two sets S j1 and S j2 ; their Jaccard coefficient is 2/5.42542542642742843143443944020.120.220.320.420.520.6The basic crawler architecture.Distributing the basic crawl architecture.The URL frontier.Example of an auxiliary hosts-to-back queues table.A lexicographically ordered set of URLs.A four-row segment of the table of links.44644945245345645721.1The random surfer at node A proceeds with probability 1/3 toeach of B, C and D.A simple Markov chain with three states; the numbers on thelinks indicate the transition probabilities.The sequence of probability vectors.18.321.221.3Online edition (c) 2009 Cambridge UP411416418418464466469List of Figuresxxv21.421.521.621.7470472479480A small web graph.Topic-specific PageRank.A sample run of HITS on the query japan elementary schools.Web graph for Exercise 21.22.Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press.