An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 100

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 100 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 1002020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 100)

Arguably, we only need to have “high-quality”Online edition (c) 2009 Cambridge UP19.2 Web characteristics423web pages in the taxonomy, with only the best web pages for each category.However, just discovering these and classifying them accurately and consistently into the taxonomy entails significant human effort. Furthermore, inorder for a user to effectively discover web pages classified into the nodes ofthe taxonomy tree, the user’s idea of what sub-tree(s) to seek for a particular topic should match that of the editors performing the classification.

Thisquickly becomes challenging as the size of the taxonomy grows; the Yahoo!taxonomy tree surpassed 1000 distinct nodes fairly early on. Given thesechallenges, the popularity of taxonomies declined over time, even thoughvariants (such as About.com and the Open Directory Project) sprang up withsubject-matter experts collecting and annotating web pages for each category.The first generation of web search engines transported classical searchtechniques such as those in the preceding chapters to the web domain, focusing on the challenge of scale.

The earliest web search engines had to contendwith indexes containing tens of millions of documents, which was a few orders of magnitude larger than any prior information retrieval system in thepublic domain. Indexing, query serving and ranking at this scale requiredthe harnessing together of tens of machines to create highly available systems, again at scales not witnessed hitherto in a consumer-facing search application. The first generation of web search engines was largely successfulat solving these challenges while continually indexing a significant fractionof the Web, all the while serving queries with sub-second response times.However, the quality and relevance of web search results left much to bedesired owing to the idiosyncrasies of content creation on the Web that wediscuss in Section 19.2. This necessitated the invention of new ranking andspam-fighting techniques in order to ensure the quality of the search results.While classical information retrieval techniques (such as those covered earlier in this book) continue to be necessary for web search, they are not byany means sufficient.

A key aspect (developed further in Chapter 21) is thatwhereas classical techniques measure the relevance of a document to a query,there remains a need to gauge the authoritativeness of a document based oncues such as which website hosts it.19.2Web characteristicsThe essential feature that led to the explosive growth of the web – decentralized content publishing with essentially no central control of authorship –turned out to be the biggest challenge for web search engines in their quest toindex and retrieve this content. Web page authors created content in dozensof (natural) languages and thousands of dialects, thus demanding many different forms of stemming and other linguistic operations. Because publish-Online edition (c) 2009 Cambridge UP42419 Web search basicsSTATIC WEB PAGESing was now open to tens of millions, web pages exhibited heterogeneity at adaunting scale, in many crucial aspects.

First, content-creation was no longerthe privy of editorially-trained writers; while this represented a tremendousdemocratization of content creation, it also resulted in a tremendous variation in grammar and style (and in many cases, no recognizable grammar orstyle). Indeed, web publishing in a sense unleashed the best and worst ofdesktop publishing on a planetary scale, so that pages quickly became riddled with wild variations in colors, fonts and structure. Some web pages,including the professionally created home pages of some large corporations,consisted entirely of images (which, when clicked, led to richer textual content) – and therefore, no indexable text.What about the substance of the text in web pages? The democratizationof content creation on the web meant a new level of granularity in opinion onvirtually any subject.

This meant that the web contained truth, lies, contradictions and suppositions on a grand scale. This gives rise to the question:which web pages does one trust? In a simplistic approach, one might arguethat some publishers are trustworthy and others not – begging the questionof how a search engine is to assign such a measure of trust to each websiteor web page.

In Chapter 21 we will examine approaches to understandingthis question. More subtly, there may be no universal, user-independent notion of trust; a web page whose contents are trustworthy to one user maynot be so to another. In traditional (non-web) publishing this is not an issue:users self-select sources they find trustworthy. Thus one reader may findthe reporting of The New York Times to be reliable, while another may preferThe Wall Street Journal. But when a search engine is the only viable meansfor a user to become aware of (let alone select) most content, this challengebecomes significant.While the question “how big is the Web?” has no easy answer (see Section 19.5), the question “how many web pages are in a search engine’s index”is more precise, although, even this question has issues. By the end of 1995,Altavista reported that it had crawled and indexed approximately 30 millionstatic web pages.

Static web pages are those whose content does not vary fromone request for that page to the next. For this purpose, a professor who manually updates his home page every week is considered to have a static webpage, but an airport’s flight status page is considered to be dynamic. Dynamic pages are typically mechanically generated by an application serverin response to a query to a database, as show in Figure 19.1. One sign ofsuch a page is that the URL has the character "?" in it. Since the numberof static web pages was believed to be doubling every few months in 1995,early web search engines such as Altavista had to constantly add hardwareand bandwidth for crawling and indexing web pages.Online edition (c) 2009 Cambridge UP42519.2 Web characteristics◮ Figure 19.1 A dynamically generated web page.

The browser sends a request forflight information on flight AA129 to the web application, that fetches the information from back-end databases then creates a dynamic web page that it returns to thebrowser.'$anchor&%'$-&%◮ Figure 19.2 Two nodes of the web graph joined by a link.19.2.1ANCHOR TEXTIN - LINKSOUT- LINKSThe web graphWe can view the static Web consisting of static HTML pages together withthe hyperlinks between them as a directed graph in which each web page isa node and each hyperlink a directed edge.Figure 19.2 shows two nodes A and B from the web graph, each corresponding to a web page, with a hyperlink from A to B. We refer to the set ofall such nodes and directed edges as the web graph. Figure 19.2 also showsthat (as is the case with most links on web pages) there is some text surrounding the origin of the hyperlink on page A.

This text is generally encapsulatedin the href attribute of the <a> (for anchor) tag that encodes the hyperlinkin the HTML code of page A, and is referred to as anchor text. As one mightsuspect, this directed graph is not strongly connected: there are pairs of pagessuch that one cannot proceed from one page of the pair to the other by following hyperlinks. We refer to the hyperlinks into a page as in-links and thoseout of a page as out-links. The number of in-links to a page (also known asits in-degree) has averaged from roughly 8 to 15, in a range of studies.

Wesimilarly define the out-degree of a web page to be the number of links outOnline edition (c) 2009 Cambridge UP42619 Web search basics◮ Figure 19.3 A sample small web graph. In this example we have six pages labeledA-F. Page B has in-degree 3 and out-degree 1. This example graph is not stronglyconnected: there is no path from any of pages B-F to page A.POWER LAWBOWTIEof it.

These notions are represented in Figure 19.3.There is ample evidence that these links are not randomly distributed; forone thing, the distribution of the number of links into a web page does notfollow the Poisson distribution one would expect if every web page wereto pick the destinations of its links uniformly at random.

Rather, this distribution is widely reported to be a power law, in which the total number ofweb pages with in-degree i is proportional to 1/i α ; the value of α typicallyreported by studies is 2.1.1 Furthermore, several studies have suggested thatthe directed graph connecting web pages has a bowtie shape: there are threemajor categories of web pages that are sometimes referred to as IN, OUTand SCC. A web surfer can pass from any page in IN to any page in SCC, byfollowing hyperlinks. Likewise, a surfer can pass from page in SCC to anypage in OUT. Finally, the surfer can surf from any page in SCC to any otherpage in SCC.

However, it is not possible to pass from a page in SCC to anypage in IN, or from a page in OUT to a page in SCC (or, consequently, IN).Notably, in several studies IN and OUT are roughly equal in size, whereas1. Cf. Zipf’s law of the distribution of words in text in Chapter 5 (page 90), which is a powerlaw with α = 1.Online edition (c) 2009 Cambridge UP19.2 Web characteristics427◮ Figure 19.4 The bowtie structure of the Web.

Here we show one tube and threetendrils.SCC is somewhat larger; most web pages fall into one of these three sets. Theremaining pages form into tubes that are small sets of pages outside SCC thatlead directly from IN to OUT, and tendrils that either lead nowhere from IN,or from nowhere to OUT. Figure 19.4 illustrates this structure of the Web.19.2.2SPAMSpamEarly in the history of web search, it became clear that web search engineswere an important means for connecting advertisers to prospective buyers.A user searching for maui golf real estate is not merely seeking news or entertainment on the subject of housing on golf courses on the island of Maui,but instead likely to be seeking to purchase such a property.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.