An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 52
Текст из файла (страница 52)
These results demonstrate the benefitsof structured retrieval. Structured retrieval imposes additional constraints onwhat to return and documents that pass the structural filter are more likelyto be relevant. Recall may suffer because some relevant documents will befiltered out, but for precision-oriented tasks structured retrieval is superior.10.5TEXT- CENTRIC XMLDATA - CENTRIC XMLText-centric vs.
data-centric XML retrievalIn the type of structured retrieval we cover in this chapter, XML structureserves as a framework within which we match the text of the query with thetext of the XML documents. This exemplifies a system that is optimized fortext-centric XML. While both text and structure are important, we give higherpriority to text. We do this by adapting unstructured retrieval methods tohandling additional structural constraints. The premise of our approach isthat XML document retrieval is characterized by (i) long text fields (e.g., sections of a document), (ii) inexact matching, and (iii) relevance-ranked results.Relational databases do not deal well with this use case.In contrast, data-centric XML mainly encodes numerical and non-text attributevalue data. When querying data-centric XML, we want to impose exactmatch conditions in most cases.
This puts the emphasis on the structuralaspects of XML documents and queries. An example is:Online edition (c) 2009 Cambridge UP10.5 Text-centric vs. data-centric XML retrieval215Find employees whose salary is the same this month as it was 12 monthsago.This query requires no ranking. It is purely structural and an exact matchingof the salaries in the two time periods is probably sufficient to meet the user’sinformation need.Text-centric approaches are appropriate for data that are essentially textdocuments, marked up as XML to capture document structure. This is becoming a de facto standard for publishing text databases since most textdocuments have some form of interesting structure – paragraphs, sections,footnotes etc.
Examples include assembly manuals, issues of journals, Shakespeare’s collected works and newswire articles.Data-centric approaches are commonly used for data collections with complex structures that mainly contain non-text data. A text-centric retrievalengine will have a hard time with proteomic data in bioinformatics or withthe representation of a city map that (together with street names and othertextual descriptions) forms a navigational database.Two other types of queries that are difficult to handle in a text-centric structured retrieval model are joins and ordering constraints.
The query for employees with unchanged salary requires a join. The following query imposesan ordering constraint:Retrieve the chapter of the book Introduction to algorithms that followsthe chapter Binomial heaps.This query relies on the ordering of elements in XML – in this case the ordering of chapter elements underneath the book node. There are powerful querylanguages for XML that can handle numerical attributes, joins and orderingconstraints. The best known of these is XQuery, a language proposed forstandardization by the W3C.
It is designed to be broadly applicable in all areas where XML is used. Due to its complexity, it is challenging to implementan XQuery-based ranked retrieval system with the performance characteristics that users have come to expect in information retrieval. This is currentlyone of the most active areas of research in XML retrieval.Relational databases are better equipped to handle many structural constraints, particularly joins (but ordering is also difficult in a database framework – the tuples of a relation in the relational calculus are not ordered). Forthis reason, most data-centric XML retrieval systems are extensions of relational databases (see the references in Section 10.6). If text fields are short,exact matching meets user needs and retrieval results in form of unorderedsets are acceptable, then using a relational database for XML retrieval is appropriate.Online edition (c) 2009 Cambridge UP21610 XML retrieval10.6XML FRAGMENTReferences and further readingThere are many good introductions to XML, including (Harold and Means2004).
Table 10.1 is inspired by a similar table in (van Rijsbergen 1979). Section 10.4 follows the overview of INEX 2002 by Gövert and Kazai (2003),published in the proceedings of the meeting (Fuhr et al. 2003a). The proceedings of the four following INEX meetings were published as Fuhr et al.(2003b), Fuhr et al. (2005), Fuhr et al. (2006), and Fuhr et al. (2007). An uptodate overview article is Fuhr and Lalmas (2007). The results in Table 10.4are from (Kamps et al.
2006). Chu-Carroll et al. (2006) also present evidencethat XML queries increase precision compared with unstructured queries.Instead of coverage and relevance, INEX now evaluates on the related butdifferent dimensions of exhaustivity and specificity (Lalmas and Tombros2007). Trotman et al. (2006) relate the tasks investigated at INEX to real worlduses of structured retrieval such as structured book search on internet bookstore sites.The structured document retrieval principle is due to Chiaramella et al.(1996). Figure 10.5 is from (Fuhr and Großjohann 2004).
Rahm and Bernstein(2001) give a survey of automatic schema matching that is applicable to XML.The vector-space based XML retrieval method in Section 10.3 is essentiallyIBM Haifa’s JuruXML system as presented by Mass et al. (2003) and Carmelet al. (2003). Schlieder and Meuss (2002) and Grabs and Schek (2002) describesimilar approaches. Carmel et al. (2003) represent queries as XML fragments.The trees that represent XML queries in this chapter are all XML fragments,but XML fragments also permit the operators +, − and phrase on contentnodes.We chose to present the vector space model for XML retrieval because itis simple and a natural extension of the unstructured vector space modelin Chapter 6.
But many other unstructured retrieval methods have beenapplied to XML retrieval with at least as much success as the vector spacemodel. These methods include language models (cf. Chapter 12, e.g., Kampset al. (2004), List et al. (2005), Ogilvie and Callan (2005)), systems that usea relational database as a backend (Mihajlović et al.
2005, Theobald et al.2005; 2008), probabilistic weighting (Lu et al. 2007), and fusion (Larson 2005).There is currently no consensus as to what the best approach to XML retrievalis.Most early work on XML retrieval accomplished relevance ranking by focusing on individual terms, including their structural contexts, in query anddocument.
As in unstructured information retrieval, there is a trend in morerecent work to model relevance ranking as combining evidence from disparate measurements about the query, the document and their match. Thecombination function can be tuned manually (Arvola et al. 2005, Sigurbjörnsson et al. 2004) or trained using machine learning methods (Vittaut and Gal-Online edition (c) 2009 Cambridge UP10.7 ExercisesFOCUSED RETRIEVALPASSAGE RETRIEVAL217linari (2006), cf. Section 15.4.1, page 341).An active area of XML retrieval research is focused retrieval (Trotman et al.2007), which aims to avoid returning nested elements that share one or morecommon subelements (cf. discussion in Section 10.2, page 203).
There is evidence that users dislike redundancy caused by nested elements (Betsi et al.2006). Focused retrieval requires evaluation measures that penalize redundant results lists (Kazai and Lalmas 2006, Lalmas et al. 2007). Trotman andGeva (2006) argue that XML retrieval is a form of passage retrieval. In passageretrieval (Salton et al. 1993, Hearst and Plaunt 1993, Zobel et al. 1995, Hearst1997, Kaszkiel and Zobel 1997), the retrieval system returns short passagesinstead of documents in response to a user query. While element boundaries in XML documents are cues for identifying good segment boundariesbetween passages, the most relevant passage often does not coincide with anXML element.In the last several years, the query format at INEX has been the NEXI standard proposed by Trotman and Sigurbjörnsson (2004).
Figure 10.3 is fromtheir paper. O’Keefe and Trotman (2004) give evidence that users cannot reliably distinguish the child and descendant axes. This justifies only permittingdescendant axes in NEXI (and XML fragments). These structural constraintswere only treated as “hints” in recent INEXes. Assessors can judge an element highly relevant even though it violates one of the structural constraintsspecified in a NEXI query.An alternative to structured query languages like NEXI is a more sophisticated user interface for query formulation (Tannier and Geva 2005, van Zwolet al. 2006, Woodley and Geva 2006).A broad overview of XML retrieval that covers database as well as IR approaches is given by Amer-Yahia and Lalmas (2006) and an extensive reference list on the topic can be found in (Amer-Yahia et al.