An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 23
Текст из файла (страница 23)
This is finefor collections that change infrequently or never (e.g., the Bible or Shakespeare). But most collections are modified frequently with documents beingadded, deleted, and updated. This means that new terms need to be addedto the dictionary, and postings lists need to be updated for existing terms.The simplest way to achieve this is to periodically reconstruct the indexfrom scratch. This is a good solution if the number of changes over time issmall and a delay in making new documents searchable is acceptable – andif enough resources are available to construct a new index while the old oneis still available for querying.If there is a requirement that new documents be included quickly, one solution is to maintain two indexes: a large main index and a small auxiliary indexthat stores new documents. The auxiliary index is kept in memory.
Searchesare run across both indexes and results merged. Deletions are stored in an invalidation bit vector. We can then filter out deleted documents before returning the search result. Documents are updated by deleting and reinsertingthem.Each time the auxiliary index becomes too large, we merge it into the mainindex. The cost of this merging operation depends on how we store the indexin the file system. If we store each postings list as a separate file, then themerge simply consists of extending each postings list of the main index bythe corresponding postings list of the auxiliary index.
In this scheme, thereason for keeping the auxiliary index is to reduce the number of disk seeksrequired over time. Updating each document separately requires up to Mavedisk seeks, where Mave is the average size of the vocabulary of documents inthe collection.
With an auxiliary index, we only put additional load on thedisk when we merge auxiliary and main indexes.Unfortunately, the one-file-per-postings-list scheme is infeasible becausemost file systems cannot efficiently handle very large numbers of files. Thesimplest alternative is to store the index as one large file, that is, as a concatenation of all postings lists. In reality, we often choose a compromise betweenthe two extremes (Section 4.7). To simplify the discussion, we choose thesimple option of storing the index as one large file here.Online edition (c) 2009 Cambridge UP4.5 Dynamic indexing79LM ERGE A DD T OKEN (indexes, Z0 , token)1 Z0 ← M ERGE( Z0 , {token})2 if | Z0 | = n3then for i ← 0 to ∞4do if Ii ∈ indexes5then Zi+1 ← M ERGE( Ii , Zi )6(Zi+1 is a temporary index on disk.)7indexes ← indexes − { Ii }8else Ii ← Zi (Zi becomes the permanent index Ii .)9indexes ← indexes ∪ { Ii }10B REAK11Z0 ← ∅L OGARITHMIC M ERGE ()1 Z0 ← ∅ (Z0 is the in-memory index.)2 indexes ← ∅3 while true4 do LM ERGE A DD T OKEN (indexes, Z0 , GET N EXT T OKEN())◮ Figure 4.7 Logarithmic merging.
Each token (termID,docID) is initially added toin-memory index Z0 by LM ERGE A DD T OKEN . L OGARITHMIC M ERGE initializes Z0and indexes.LOGARITHMICMERGINGIn this scheme, we process each posting ⌊ T/n⌋ times because we touch itduring each of ⌊ T/n⌋ merges where n is the size of the auxiliary index and Tthe total number of postings. Thus, the overall time complexity is Θ( T 2 /n).(We neglect the representation of terms here and consider only the docIDs.For the purpose of time complexity, a postings list is simply a list of docIDs.)We can do better than Θ( T 2 /n) by introducing log2 ( T/n) indexes I0 , I1 ,I2 , .
. . of size 20 × n, 21 × n, 22 × n . . . . Postings percolate up this sequence ofindexes and are processed only once on each level. This scheme is called logarithmic merging (Figure 4.7). As before, up to n postings are accumulated inan in-memory auxiliary index, which we call Z0 . When the limit n is reached,the 20 × n postings in Z0 are transferred to a new index I0 that is created ondisk.
The next time Z0 is full, it is merged with I0 to create an index Z1 of size21 × n. Then Z1 is either stored as I1 (if there isn’t already an I1 ) or mergedwith I1 into Z2 (if I1 exists); and so on. We service search requests by querying in-memory Z0 and all currently valid indexes Ii on disk and merging theresults. Readers familiar with the binomial heap data structure2 will recog2. See, for example, (Cormen et al.
1990, Chapter 19).Online edition (c) 2009 Cambridge UP804 Index constructionnize its similarity with the structure of the inverted indexes in logarithmicmerging.Overall index construction time is Θ( T log( T/n)) because each postingis processed only once on each of the log( T/n) levels. We trade this efficiency gain for a slow down of query processing; we now need to mergeresults from log( T/n) indexes as opposed to just two (the main and auxiliary indexes). As in the auxiliary index scheme, we still need to merge verylarge indexes occasionally (which slows down the search system during themerge), but this happens less frequently and the indexes involved in a mergeon average are smaller.Having multiple indexes complicates the maintenance of collection-widestatistics. For example, it affects the spelling correction algorithm in Section 3.3 (page 56) that selects the corrected alternative with the most hits.With multiple indexes and an invalidation bit vector, the correct number ofhits for a term is no longer a simple lookup.
In fact, all aspects of an IRsystem – index maintenance, query processing, distribution, and so on – aremore complex in logarithmic merging.Because of this complexity of dynamic indexing, some large search enginesadopt a reconstruction-from-scratch strategy. They do not construct indexesdynamically. Instead, a new index is built from scratch periodically. Queryprocessing is then switched from the new index and the old index is deleted.?Exercise 4.4For n = 2 and 1 ≤ T ≤ 30, perform a step-by-step simulation of the algorithm inFigure 4.7.
Create a table that shows, for each point in time at which T = 2 ∗ k tokenshave been processed (1 ≤ k ≤ 15), which of the three indexes I0 , . . . , I3 are in use. Thefirst three lines of the table are given below.2464.6I3000I2000I1001I0010Other types of indexesThis chapter only describes construction of nonpositional indexes. Exceptfor the much larger data volume we need to accommodate, the main difference for positional indexes is that (termID, docID, (position1, position2, . .
. ))triples, instead of (termID, docID) pairs have to be processed and that tokensand postings contain positional information in addition to docIDs. With thischange, the algorithms discussed here can all be applied to positional indexes.In the indexes we have considered so far, postings lists are ordered withrespect to docID. As we see in Chapter 5, this is advantageous for compres-Online edition (c) 2009 Cambridge UP814.6 Other types of indexesdocumentsusers0/10 if user can’t readdoc, 1 otherwise.◮ Figure 4.8 A user-document matrix for access control lists. Element (i, j) is 1 ifuser i has access to document j and 0 otherwise. During query processing, a user’saccess postings list is intersected with the results list returned by the text part of theindex.RANKEDRETRIEVAL SYSTEMSSECURITYACCESS CONTROL LISTSsion – instead of docIDs we can compress smaller gaps between IDs, thusreducing space requirements for the index.
However, this structure for theindex is not optimal when we build ranked (Chapters 6 and 7) – as opposed toBoolean – retrieval systems. In ranked retrieval, postings are often ordered according to weight or impact, with the highest-weighted postings occurringfirst. With this organization, scanning of long postings lists during queryprocessing can usually be terminated early when weights have become sosmall that any further documents can be predicted to be of low similarityto the query (see Chapter 6). In a docID-sorted index, new documents arealways inserted at the end of postings lists. In an impact-sorted index (Section 7.1.5, page 140), the insertion can occur anywhere, thus complicating theupdate of the inverted index.Securityis an important consideration for retrieval systems in corporations.A low-level employee should not be able to find the salary roster of the corporation, but authorized managers need to be able to search for it.
Users’results lists must not contain documents they are barred from opening; thevery existence of a document can be sensitive information.User authorization is often mediated through access control lists or ACLs.ACLs can be dealt with in an information retrieval system by representingeach document as the set of users that can access them (Figure 4.8) and theninverting the resulting user-document matrix. The inverted ACL index has,for each user, a “postings list” of documents they can access – the user’s access list. Search results are then intersected with this list.
However, suchan index is difficult to maintain when access permissions change – we discussed these difficulties in the context of incremental indexing for regularpostings lists in Section 4.5. It also requires the processing of very long postings lists for users with access to large document subsets. User membershipis therefore often verified by retrieving access information directly from thefile system at query time – even though this slows down retrieval.Online edition (c) 2009 Cambridge UP824 Index construction◮ Table 4.3 The five steps in constructing an index for Reuters-RCV1 in blockedsort-based indexing. Line numbers refer to Figure 4.2.12345Stepreading of collection (line 4)10 initial sorts of 107 records each (line 5)writing of 10 blocks (line 6)total disk transfer time for merging (line 7)time of actual merging (line 7)totalTime◮ Table 4.4 Collection statistics for a large collection.SymbolNLaveMStatistic# documents# tokens per document# distinct termsValue1,000,000,000100044,000,000We discussed indexes for storing and retrieving terms (as opposed to documents) in Chapter 3.??Exercise 4.5Can spelling correction compromise document-level security? Consider the case wherea spelling correction is based on documents to which the user does not have access.Exercise 4.6Total index construction time in blocked sort-based indexing is broken down in Table 4.3.