An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 49

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 49 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 492020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 49)

See http://www.w3.org/DOM/.Online edition (c) 2009 Cambridge UP20010 XML retrievalscenebookbookversetitletitleauthortitleWill I . . .M’s castleJulius CaesarJulius CaesarGallic ward1q1q2◮ Figure 10.4 Tree representation of XML documents and queries.NEXIA common format for XML queries is NEXI (Narrowed Extended XPathI).

We give an example in Figure 10.3. We display the query on four lines fortypographical convenience, but it is intended to be read as one unit withoutline breaks. In particular, //section is embedded under //article.The query in Figure 10.3 specifies a search for sections about the summer holidays that are part of articles from 2001 or 2002.

As in XPath double slashes indicate that an arbitrary number of elements can intervene ona path. The dot in a clause in square brackets refers to the element theclause modifies. The clause [.//yr = 2001 or .//yr = 2002] modifies //article. Thus, the dot refers to //article in this case. Similarly,the dot in [about(., summer holidays)] refers to the section that theclause modifies.The two yr conditions are relational attribute constraints.

Only articleswhose yr attribute is 2001 or 2002 (or that contain an element whose yrattribute is 2001 or 2002) are to be considered. The about clause is a rankingconstraint: Sections that occur in the right type of article are to be rankedaccording to how relevant they are to the topic summer holidays.We usually handle relational attribute constraints by prefiltering or postfiltering: We simply exclude all elements from the result set that do not meetthe relational attribute constraints.

In this chapter, we will not address howto do this efficiently and instead focus on the core information retrieval problem in XML retrieval, namely how to rank documents according to the relevance criteria expressed in the about conditions of the NEXI query.If we discard relational attributes, we can represent documents as treeswith only one type of node: element nodes. In other words, we removeall attribute nodes from the XML document, such as the number attribute inFigure 10.1. Figure 10.4 shows a subtree of the document in Figure 10.1 as anelement-node tree (labeled d1 ).Online edition (c) 2009 Cambridge UP10.2 Challenges in XML retrieval201We can represent queries as trees in the same way.

This is a query-byexample approach to query language design because users pose queries bycreating objects that satisfy the same formal description as documents. InFigure 10.4, q1 is a search for books whose titles score highly for the keywordsJulius Caesar. q2 is a search for books whose author elements score highly forJulius Caesar and whose title elements score highly for Gallic war.310.2STRUCTUREDDOCUMENT RETRIEVALPRINCIPLEINDEXING UNITChallenges in XML retrievalIn this section, we discuss a number of challenges that make structured retrieval more difficult than unstructured retrieval. Recall from page 195 thebasic setting we assume in structured retrieval: the collection consists ofstructured documents and queries are either structured (as in Figure 10.3)or unstructured (e.g., summer holidays).The first challenge in structured retrieval is that users want us to returnparts of documents (i.e., XML elements), not entire documents as IR systemsusually do in unstructured retrieval.

If we query Shakespeare’s plays forMacbeth’s castle, should we return the scene, the act or the entire play in Figure 10.2? In this case, the user is probably looking for the scene. On the otherhand, an otherwise unspecified search for Macbeth should return the play ofthis name, not a subunit.One criterion for selecting the most appropriate part of a document is thestructured document retrieval principle:Structured document retrieval principle.

A system should always retrieve the most specific part of a document answering the query.This principle motivates a retrieval strategy that returns the smallest unitthat contains the information sought, but does not go below this level. However, it can be hard to implement this principle algorithmically.

Consider thequery title#"Macbeth" applied to Figure 10.2. The title of the tragedy,Macbeth, and the title of Act I, Scene vii, Macbeth’s castle, are both good hitsbecause they contain the matching term Macbeth. But in this case, the title ofthe tragedy, the higher node, is preferred. Deciding which level of the tree isright for answering a query is difficult.Parallel to the issue of which parts of a document to return to the user isthe issue of which parts of a document to index. In Section 2.1.2 (page 20), wediscussed the need for a document unit or indexing unit in indexing and retrieval.

In unstructured retrieval, it is usually clear what the right document3. To represent the semantics of NEXI queries fully we would also need to designate one nodein the tree as a “target node”, for example, the section in the tree in Figure 10.3. Without thedesignation of a target node, the tree in Figure 10.3 is not a search for sections embedded inarticles (as specified by NEXI), but a search for articles that contain sections.Online edition (c) 2009 Cambridge UP20210 XML retrieval◮ Figure 10.5 Partitioning an XML document into non-overlapping indexing units.unit is: files on your desktop, email messages, web pages on the web etc.

Instructured retrieval, there are a number of different approaches to definingthe indexing unit.One approach is to group nodes into non-overlapping pseudodocumentsas shown in Figure 10.5. In the example, books, chapters and sections havebeen designated to be indexing units, but without overlap. For example, theleftmost dashed indexing unit contains only those parts of the tree dominated by book that are not already part of other indexing units. The disadvantage of this approach is that pseudodocuments may not make sense tothe user because they are not coherent units.

For instance, the leftmost indexing unit in Figure 10.5 merges three disparate elements, the class, authorand title elements.We can also use one of the largest elements as the indexing unit, for example, the book element in a collection of books or the play element for Shakespeare’s works. We can then postprocess search results to find for each bookor play the subelement that is the best hit. For example, the query Macbeth’scastle may return the play Macbeth, which we can then postprocess to identifyact I, scene vii as the best-matching subelement. Unfortunately, this twostage retrieval process fails to return the best subelement for many queriesbecause the relevance of a whole book is often not a good predictor of therelevance of small subelements within it.Instead of retrieving large units and identifying subelements (top down),we can also search all leaves, select the most relevant ones and then extendthem to larger units in postprocessing (bottom up).

For the query Macbeth’scastle in Figure 10.1, we would retrieve the title Macbeth’s castle in the firstpass and then decide in a postprocessing step whether to return the title, thescene, the act or the play. This approach has a similar problem as the last one:The relevance of a leaf element is often not a good predictor of the relevanceOnline edition (c) 2009 Cambridge UP10.2 Challenges in XML retrievalNESTED ELEMENTS203of elements it is contained in.The least restrictive approach is to index all elements. This is also problematic. Many XML elements are not meaningful search results, e.g., typographical elements like <b>definitely</b> or an ISBN number whichcannot be interpreted without context. Also, indexing all elements meansthat search results will be highly redundant. For the query Macbeth’s castleand the document in Figure 10.1, we would return all of the play, act, sceneand title elements on the path between the root node and Macbeth’s castle.The leaf node would then occur four times in the result set, once directly andthree times as part of other elements.

We call elements that are containedwithin each other nested. Returning redundant nested elements in a list ofreturned hits is not very user-friendly.Because of the redundancy caused by nested elements it is common to restrict the set of elements that are eligible to be returned. Restriction strategiesinclude:• discard all small elements• discard all element types that users do not look at (this requires a workingXML retrieval system that logs this information)• discard all element types that assessors generally do not judge to be relevant (if relevance assessments are available)• only keep element types that a system designer or librarian has deemedto be useful search resultsIn most of these approaches, result sets will still contain nested elements.Thus, we may want to remove some elements in a postprocessing step to reduce redundancy.

Alternatively, we can collapse several nested elements inthe results list and use highlighting of query terms to draw the user’s attention to the relevant passages. If query terms are highlighted, then scanning amedium-sized element (e.g., a section) takes little more time than scanning asmall subelement (e.g., a paragraph). Thus, if the section and the paragraphboth occur in the results list, it is sufficient to show the section. An additionaladvantage of this approach is that the paragraph is presented together withits context (i.e., the embedding section). This context may be helpful in interpreting the paragraph (e.g., the source of the information reported) evenif the paragraph on its own satisfies the query.If the user knows the schema of the collection and is able to specify thedesired type of element, then the problem of redundancy is alleviated as fewnested elements have the same type.

But as we discussed in the introduction,users often don’t know what the name of an element in the collection is (Is theVatican a country or a city?) or they may not know how to compose structuredqueries at all.Online edition (c) 2009 Cambridge UP20410 XML retrievalbookbookbookauthorauthorbookcreatorfirstnamelastnameGatesGatesGatesBillGatesq3q4d2d3◮ Figure 10.6 Schema heterogeneity: intervening nodes and mismatched names.SCHEMAHETEROGENEITYA challenge in XML retrieval related to nesting is that we may need todistinguish different contexts of a term when we compute term statistics forranking, in particular inverse document frequency (idf) statistics as definedin Section 6.2.1 (page 117). For example, the term Gates under the node authoris unrelated to an occurrence under a content node like section if used to referto the plural of gate.

It makes little sense to compute a single documentfrequency for Gates in this example.One solution is to compute idf for XML-context/term pairs, e.g., to compute different idf weights for author#"Gates" and section#"Gates".Unfortunately, this scheme will run into sparse data problems – that is, manyXML-context pairs occur too rarely to reliably estimate df (see Section 13.2,page 260, for a discussion of sparseness). A compromise is only to consider the parent node x of the term and not the rest of the path from theroot to x to distinguish contexts. There are still conflations of contexts thatare harmful in this scheme.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.