An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 48
Текст из файла (страница 48)
We call the search oversuch structured documents structured retrieval. Queries in structured retrievalcan be either structured or unstructured, but we will assume in this chapter that the collection consists only of structured documents. Applicationsof structured retrieval include digital libraries, patent databases, blogs, textin which entities like persons and locations have been tagged (in a processcalled named entity tagging) and output from office suites like OpenOfficethat save documents as marked up text. In all of these applications, we wantto be able to run queries that combine textual criteria with structural criteria.Examples of such queries are give me a full-length article on fast fourier transforms(digital libraries), give me patents whose claims mention RSA public key encryption1.
In most modern database systems, one can enable full-text search for text columns. Thisusually means that an inverted index is created and Boolean or vector space search enabled,effectively combining core database with information retrieval technologies.Online edition (c) 2009 Cambridge UP19610 XML retrievalobjectsmodelmain data structurequeriesRDB searchrecordsrelational modeltableSQLunstructured retrievalunstructured documentsvector space & othersinverted indexfree text queriesstructured retrievaltrees with text at leaves???◮ Table 10.1 RDB (relational database) search, unstructured information retrievaland structured information retrieval.
There is no consensus yet as to which methodswork best for structured retrieval although many researchers believe that XQuery(page 215) will become the standard for structured queries.and that cite US patent 4,405,829 (patents), or give me articles about sightseeingtours of the Vatican and the Coliseum (entity-tagged text). These three queriesXMLDATA - CENTRIC XMLare structured queries that cannot be answered well by an unranked retrievalsystem. As we argued in Example 1.1 (page 15) unranked retrieval modelslike the Boolean model suffer from low recall. For instance, an unrankedsystem would return a potentially large number of articles that mention theVatican, the Coliseum and sightseeing tours without ranking the ones thatare most relevant for the query first.
Most users are also notoriously bad atprecisely stating structural constraints. For instance, users may not knowfor which structured elements the search system supports search. In our example, the user may be unsure whether to issue the query as sightseeing AND( COUNTRY:Vatican OR LANDMARK :Coliseum) , as sightseeing AND ( STATE :Vatican ORBUILDING :Coliseum) or in some other form. Users may also be completely unfamiliar with structured search and advanced search interfaces or unwillingto use them.
In this chapter, we look at how ranked retrieval methods can beadapted to structured documents to address these problems.We will only look at one standard for encoding structured documents: Extensible Markup Language or XML, which is currently the most widely usedsuch standard. We will not cover the specifics that distinguish XML fromother types of markup such as HTML and SGML. But most of what we sayin this chapter is applicable to markup languages in general.In the context of information retrieval, we are only interested in XML asa language for encoding text and documents.
A perhaps more widespreaduse of XML is to encode non-text data. For example, we may want to exportdata in XML format from an enterprise resource planning system and thenread them into an analytics program to produce graphs for a presentation.This type of application of XML is called data-centric because numerical andnon-text attribute-value data dominate and text is usually a small fraction ofthe overall data. Most data-centric XML is stored in databases – in contrastto the inverted index-based methods for text-centric XML that we present inthis chapter.Online edition (c) 2009 Cambridge UP10.1 Basic XML conceptsSEMISTRUCTUREDRETRIEVAL10.1XML ELEMENTXML ATTRIBUTEXML DOM197We call XML retrieval structured retrieval in this chapter.
Some researchersprefer the term semistructured retrieval to distinguish XML retrieval from databasequerying. We have adopted the terminology that is widespread in the XMLretrieval community. For instance, the standard way of referring to XMLqueries is structured queries, not semistructured queries. The term structuredretrieval is rarely used for database querying and it always refers to XMLretrieval in this book.There is a second type of information retrieval problem that is intermediatebetween unstructured retrieval and querying a relational database: parametric and zone search, which we discussed in Section 6.1 (page 110). In thedata model of parametric and zone search, there are parametric fields (relational attributes like date or file-size) and zones – text attributes that eachtake a chunk of unstructured text as value, e.g., author and title in Figure 6.1(page 111).
The data model is flat, that is, there is no nesting of attributes.The number of attributes is small. In contrast, XML documents have themore complex tree structure that we see in Figure 10.2 in which attributesare nested. The number of attributes and nodes is greater than in parametricand zone search.After presenting the basic concepts of XML in Section 10.1, this chapterfirst discusses the challenges we face in XML retrieval (Section 10.2).
Next wedescribe a vector space model for XML retrieval (Section 10.3). Section 10.4presents INEX, a shared task evaluation that has been held for a number ofyears and currently is the most important venue for XML retrieval research.We discuss the differences between data-centric and text-centric approachesto XML in Section 10.5.Basic XML conceptsAn XML document is an ordered, labeled tree. Each node of the tree is anXML element and is written with an opening and closing tag. An element canhave one or more XML attributes.
In the XML document in Figure 10.1, thescene element is enclosed by the two tags <scene ...> and </scene>. Ithas an attribute number with value vii and two child elements, title and verse.Figure 10.2 shows Figure 10.1 as a tree. The leaf nodes of the tree consist oftext, e.g., Shakespeare, Macbeth, and Macbeth’s castle. The tree’s internal nodesencode either the structure of the document (title, act, and scene) or metadatafunctions (author).The standard for accessing and processing XML documents is the XMLDocument Object Model or DOM.
The DOM represents elements, attributesand text within elements as nodes in a tree. Figure 10.2 is a simplified DOMrepresentation of the XML document in Figure 10.1.2 With a DOM API, we2. The representation is simplified in a number of respects. For example, we do not show theOnline edition (c) 2009 Cambridge UP19810 XML retrieval<play><author>Shakespeare</author><title>Macbeth</title><act number="I"><scene number="vii"><title>Macbeth’s castle</title><verse>Will I with wine and wassail ...</verse></scene></act></play>◮ Figure 10.1 An XML document.root elementplayelementelementelementauthoracttitletexttextShakespeareMacbethattributeelementnumber="I"sceneattributeelementelementnumber="vii"versetitletexttextWill I with ...Macbeth’s castle◮ Figure 10.2 The XML document in Figure 10.1 as a simplified DOM object.Online edition (c) 2009 Cambridge UP19910.1 Basic XML conceptsarticlesection//article[.//yr = 2001 or .//yr = 2002]//section[about(.,summer holidays)]summerholidays◮ Figure 10.3 An XML query in NEXI format and its partial representation as a tree.XPATHXML CONTEXTSCHEMAXML DTDXML S CHEMAcan process an XML document by starting at the root element and then descending down the tree from parents to children.XPath is a standard for enumerating paths in an XML document collection.We will also refer to paths as XML contexts or simply contexts in this chapter.Only a small subset of XPath is needed for our purposes.
The XPath expression node selects all nodes of that name. Successive elements of a path areseparated by slashes, so act/scene selects all scene elements whose parent is an act element. Double slashes indicate that an arbitrary number ofelements can intervene on a path: play//scene selects all scene elementsoccurring in a play element. In Figure 10.2 this set consists of a single scene element, which is accessible via the path play, act, scene from the top.
An initialslash starts the path at the root element. /play/title selects the play’s title in Figure 10.1, /play//title selects a set with two members (the play’stitle and the scene’s title), and /scene/title selects no elements. For notational convenience, we allow the final element of a path to be a vocabularyterm and separate it from the element path by the symbol #, even though thisdoes not conform to the XPath standard.
For example, title#"Macbeth"selects all titles containing the term Macbeth.We also need the concept of schema in this chapter. A schema puts constraints on the structure of allowable XML documents for a particular application. A schema for Shakespeare’s plays may stipulate that scenes canonly occur as children of acts and that only acts and scenes have the number attribute. Two standards for schemas for XML documents are XML DTD(document type definition) and XML Schema. Users can only write structuredqueries for an XML retrieval system if they have some minimal knowledgeabout the schema of the collection.root node and text is not embedded in text nodes.