An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 50
Текст из файла (страница 50)
For instance, we do not distinguish names ofauthors and names of corporations if both have the parent node name. Butmost important distinctions, like the example contrast author#"Gates" vs.section#"Gates", will be respected.In many cases, several different XML schemas occur in a collection sincethe XML documents in an IR application often come from more than onesource. This phenomenon is called schema heterogeneity or schema diversityand presents yet another challenge. As illustrated in Figure 10.6 comparableelements may have different names: creator in d2 vs.
author in d3 . In othercases, the structural organization of the schemas may be different: AuthorOnline edition (c) 2009 Cambridge UP10.2 Challenges in XML retrievalEXTENDED QUERY205names are direct descendants of the node author in q3 , but there are the intervening nodes firstname and lastname in d3 .
If we employ strict matchingof trees, then q3 will retrieve neither d2 nor d3 although both documents arerelevant. Some form of approximate matching of element names in combination with semi-automatic matching of different document structures can helphere. Human editing of correspondences of elements in different schemaswill usually do better than automatic methods.Schema heterogeneity is one reason for query-document mismatches likeq3 /d2 and q3 /d3 . Another reason is that users often are not familiar with theelement names and the structure of the schemas of collections they searchas mentioned. This poses a challenge for interface design in XML retrieval.Ideally, the user interface should expose the tree structure of the collectionand allow users to specify the elements they are querying.
If we take thisapproach, then designing the query interface in structured retrieval is morecomplex than a search box for keyword queries in unstructured retrieval.We can also support the user by interpreting all parent-child relationshipsin queries as descendant relationships with any number of intervening nodesallowed.
We call such queries extended queries. The tree in Figure 10.3 and q4in Figure 10.6 are examples of extended queries. We show edges that areinterpreted as descendant relationships as dashed arrows. In q4 , a dashedarrow connects book and Gates. As a pseudo-XPath notation for q4 , we adoptbook//#"Gates": a book that somewhere in its structure contains the wordGates where the path from the book node to Gates can be arbitrarily long.The pseudo-XPath notation for the extended query that in addition specifiesthat Gates occurs in a section of the book is book//section//#"Gates".It is convenient for users to be able to issue such extended queries withouthaving to specify the exact structural configuration in which a query termshould occur – either because they do not care about the exact configurationor because they do not know enough about the schema of the collection to beable to specify it.In Figure 10.7, the user is looking for a chapter entitled FFT (q5 ).
Suppose there is no such chapter in the collection, but that there are references tobooks on FFT (d4 ). A reference to a book on FFT is not exactly what the useris looking for, but it is better than returning nothing. Extended queries do nothelp here. The extended query q6 also returns nothing. This is a case wherewe may want to interpret the structural constraints specified in the query ashints as opposed to as strict conditions.
As we will discuss in Section 10.4,users prefer a relaxed interpretation of structural constraints: Elements thatdo not meet structural constraints perfectly should be ranked lower, but theyshould not be omitted from search results.Online edition (c) 2009 Cambridge UP20610 XML retrievalbookchapterchapterchapterreferencestitletitletitletitleFFTFFTencryptionFFTq5q6d4◮ Figure 10.7 A structural mismatch between two queries and a document.10.3A vector space model for XML retrievalIn this section, we present a simple vector space model for XML retrieval.It is not intended to be a complete description of a state-of-the-art system.Instead, we want to give the reader a flavor of how documents can be represented and retrieved in XML retrieval.To take account of structure in retrieval in Figure 10.4, we want a bookentitled Julius Caesar to be a match for q1 and no match (or a lower weightedmatch) for q2 .
In unstructured retrieval, there would be a single dimensionof the vector space for Caesar. In XML retrieval, we must separate the titleword Caesar from the author name Caesar. One way of doing this is to haveeach dimension of the vector space encode a word together with its positionwithin the XML tree.Figure 10.8 illustrates this representation. We first take each text node(which in our setup is always a leaf) and break it into multiple nodes, one foreach word. So the leaf node Bill Gates is split into two leaves Bill and Gates.Next we define the dimensions of the vector space to be lexicalized subtreesof documents – subtrees that contain at least one vocabulary term.
A subset of these possible lexicalized subtrees is shown in the figure, but there areothers – e.g., the subtree corresponding to the whole document with the leafnode Gates removed. We can now represent queries and documents as vectors in this space of lexicalized subtrees and compute matches between them.This means that we can use the vector space formalism from Chapter 6 forXML retrieval. The main difference is that the dimensions of vector spaceOnline edition (c) 2009 Cambridge UP10.3 A vector space model for XML retrieval207◮ Figure 10.8 A mapping of an XML document (left) to a set of lexicalized subtrees(right).STRUCTURAL TERMin unstructured retrieval are vocabulary terms whereas they are lexicalizedsubtrees in XML retrieval.There is a tradeoff between the dimensionality of the space and accuracyof query results. If we trivially restrict dimensions to vocabulary terms, thenwe have a standard vector space retrieval system that will retrieve manydocuments that do not match the structure of the query (e.g., Gates in thetitle as opposed to the author element).
If we create a separate dimensionfor each lexicalized subtree occurring in the collection, the dimensionality ofthe space becomes too large. A compromise is to index all paths that endin a single vocabulary term, in other words, all XML-context/term pairs.We call such an XML-context/term pair a structural term and denote it byhc, ti: a pair of XML-context c and vocabulary term t. The document inFigure 10.8 has nine structural terms. Seven are shown (e.g., "Bill" andAuthor#"Bill") and two are not shown: /Book/Author#"Bill" and/Book/Author#"Gates".
The tree with the leaves Bill and Gates is a lexicalized subtree that is not a structural term. We use the previously introducedpseudo-XPath notation for structural terms.As we discussed in the last section users are bad at remembering detailsabout the schema and at constructing queries that comply with the schema.We will therefore interpret all queries as extended queries – that is, there canbe an arbitrary number of intervening nodes in the document for any parentchild node pair in the query. For example, we interpret q5 in Figure 10.7 asq6 .But we still prefer documents that match the query structure closely byOnline edition (c) 2009 Cambridge UP20810 XML retrievalCONTEXTRESEMBLANCE(10.1)inserting fewer additional nodes.
We ensure that retrieval results respect thispreference by computing a weight for each match. A simple measure of thesimilarity of a path cq in a query and a path cd in a document is the followingcontext resemblance function C R:(1+|c q |if cq matches cd1+|c d |C R(cq , cd ) =0if cq does not match cdwhere |cq | and |cd | are the number of nodes in the query path and documentpath, respectively, and cq matches cd iff we can transform cq into cd by inserting additional nodes. Two examples from Figure 10.6 are C R (cq4 , cd2 ) =3/4 = 0.75 and C R (cq4 , cd3 ) = 3/5 = 0.6 where cq4 , cd2 and cd3 are the relevant paths from top to leaf node in q4 , d2 and d3 , respectively.
The value ofC R (cq , cd ) is 1.0 if q and d are identical.The final score for a document is computed as a variant of the cosine measure (Equation (6.10), page 121), which we call S IM N O M ERGE for reasonsthat will become clear shortly. S IM N O M ERGE is defined as follows:(10.2)S IM N O M ERGE (q, d) =∑ ∑ck ∈ B cl ∈ BC R(ck , cl )∑ weight(q, t, ck ) qt ∈Vweight(d, t, cl )∑c∈ B,t∈V weight2 (d, t, c)where V is the vocabulary of non-structural terms; B is the set of all XML contexts; and weight(q, t, c) and weight(d, t, c) are the weights of term t in XMLcontext c in query q and document d, respectively. We compute the weightsusing one of the weightings from Chapter 6, such as idft · wft,d .
The inversedocument frequency idft depends on which elements we use to computedft as discussed in Section 10.2. The similarity measure S IM N O M ERGE (q, d)is not a true cosine measureq since its value can be larger than 1.0 (Exercise 10.11). We divide by ∑c∈ B,t∈V weight2 (d, t, c) to normalize for document length (Section 6.3.1, page 121). We have omitted query length normalization to simplify the formula.It has no effect on ranking since, forqa given query, the normalizer ∑c∈ B,t∈V weight2 (q, t, c) is the same for alldocuments.The algorithm for computing S IM N O M ERGE for all documents in the collectionis shown in Figure 10.9.
The array normalizer in Figure 10.9 containsq2∑c∈ B,t∈V weight (d, t, c) from Equation (10.2) for each document.We give an example of how S IM N O M ERGE computes query-documentsimilarities in Figure 10.10. hc1 , ti is one of the structural terms in the query.We successively retrieve all postings lists for structural terms hc′ , ti with thesame vocabulary term t.
Three example postings lists are shown. For thefirst one, we have C R (c1 , c1 ) = 1.0 since the two contexts are identical. TheOnline edition (c) 2009 Cambridge UP20910.3 A vector space model for XML retrievalS CORE D OCUMENTS W ITH S IM N O M ERGE (q, B, V, N, normalizer )1 for n ← 1 to N2 do score[n] ← 03 for each hcq , ti ∈ q4 do wq ← W EIGHT (q, t, cq )5for each c ∈ B6do if C R (cq , c) > 07then postings ← G ET P OSTINGS (hc, ti)8for each posting ∈ postings9do x ← C R (cq , c) ∗ wq ∗ weight( posting)10score[docID ( posting)] += x11 for n ← 1 to N12 do score[n] ← score[n]/normalizer [n]13 return score◮ Figure 10.9 The algorithm for scoring documents with S IM N O M ERGE .queryh c1 , t iinverted indexC R ( c1 , c1 ) = 1.0C R ( c1 , c2 ) = 0C R ( c1, c3 ) = 0.63h c1 , t i−→hd1 , 0.5ihd4 , 0.1ihd9 , 0.2i...h c2 , t i−→hd2 , 0.25ihd3 , 0.1ihd12, 0.9i...h c3 , t i−→hd3 , 0.7ihd6 , 0.8ihd9 , 0.6i...◮ Figure 10.10Scoring of a query with one structural term in S IM N O M ERGE .next context has no context resemblance with c1 : C R (c1 , c2 ) = 0 and the corresponding postings list is ignored.
The context match of c1 with c3 is 0.63>0and it will be processed. In this example, the highest ranking document is d9with a similarity of 1.0 × 0.2 + 0.63 × 0.6 = 0.578. To simplify the figure, thequery weight of hc1 , ti is assumed to be 1.0.The query-document similarity function in Figure 10.9 is called S IM N O M ERGEbecause different XML contexts are kept separate for the purpose of weighting. An alternative similarity function is S IM M ERGE which relaxes the matching conditions of query and document further in the following three ways.Online edition (c) 2009 Cambridge UP21010 XML retrieval• We collect the statistics used for computing weight(q, t, c) and weight(d, t, c)from all contexts that have a non-zero resemblance to c (as opposed to justfrom c as in S IM N O M ERGE). For instance, for computing the documentfrequency of the structural term atl#"recognition", we also countoccurrences of recognition in XML contexts fm/atl, article//atl etc.• We modify Equation (10.2) by merging all structural terms in the document that have a non-zero context resemblance to a given query structural term.