An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904)

Файл №1176904 An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров)An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904)2020-08-172020-08-17СтудИзба

тематика web-краулеров

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла

WWW 2007 / Track: SearchSession: CrawlersAn Adaptive Crawler for Locating Hidden-Web Entry PointsLuciano BarbosaJuliana FreireUniversity of UtahUniversity of Utahlbarbosa@cs.utah.edujuliana@cs.utah.eduABSTRACTmany hidden-Web sources whose data need to be integratedor searched, a key requirement for these applications is theability to locate these sources.

But doing so at a large scaleis a challenging problem.Given the dynamic nature of the Web—with new sourcesconstantly being added and old sources removed and modified, it is important to automatically discover the searchable forms that serve as entry points to the hidden-Webdatabases.

But searchable forms are very sparsely distributedover the Web, even within narrow domains. For example, a topic-focused best-first crawler [9] retrieves only 94Movie search forms after crawling 100,000 pages related tomovies. Thus, to efficiently maintain an up-to-date collection of hidden-Web sources, a crawling strategy must perform a broad search and simultaneously avoid visiting largeunproductive regions of the Web.The crawler must also produce high-quality results.

Having a homogeneous set of forms that lead to databases in thesame domain is useful, and sometimes required, for a numberof applications. For example, the effectiveness of form integration techniques [16, 25] can be greatly diminished if theset of input forms is noisy and contains forms that are notin the integration domain. However, an automated crawlingprocess invariably retrieves a diverse set of forms. A focustopic may encompass pages that contain searchable formsfrom many different database domains.

For example, whilecrawling to find Airfare search interfaces a crawler is likely toretrieve a large number of forms in different domains, suchas Rental Cars and Hotels, since these are often co-locatedwith Airfare search interfaces in travel sites. The set of retrieved forms also includes many non-searchable forms thatdo not represent database queries such as forms for login,mailing list subscriptions, quote requests, and Web-basedemail forms.The Form-Focused Crawler (FFC) [3] was our first attempt to address the problem of automatically locating online databases.

The FFC combines techniques for focusingthe crawl on a topic with a link classifier which identifiesand prioritizes links that are likely to lead to searchableforms in one or more steps. Our preliminary results showedthat the FFC is up to an order of magnitude more efficient,with respect to the number of searchable forms it retrieves,than a crawler that focuses the search on topic only.

Thisapproach, however, has important limitations. First, it requires substantial manual tuning, including the selection ofappropriate features and the creation of the link classifier.In addition, the results obtained are highly-dependent onthe quality of the set of forms used as the training for theIn this paper we describe new adaptive crawling strategiesto efficiently locate the entry points to hidden-Web sources.The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that maynot lead to immediate benefit.

We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thusgreatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representative set of domains indicate that online learning leadsto significant gains in harvest rates—the adaptive crawlersretrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Search process.General TermsAlgorithms, Design, Experimentation.KeywordsHidden Web, Web crawling strategies, online learning, learning classifiers.1.INTRODUCTIONThe hidden Web has been growing at a very fast pace.It is estimated that there are several million hidden-Websites [18].

These are sites whose contents typically residein databases and are only exposed on demand, as users fillout and submit forms. As the volume of hidden informationgrows, there has been increased interest in techniques thatallow users and applications to leverage this information.Examples of applications that attempt to make hidden-Webinformation more easily accessible include: metasearchers [14,15, 26, 28], hidden-Web crawlers [2, 21], online-database directories [7, 13] and Web information integration systems [10,17, 25].

Since for any given domain of interest, there areCopyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.ACM 978-1-59593-654-7/07/0005.441WWW 2007 / Track: SearchSession: Crawlerslink classifier. If this set is not representative, the crawlermay drift away from its target and obtain low harvest rates.Given the size of the Web, and the wide variation in thehyperlink structure, manually selecting a set of forms thatcover a representative set of link patterns can be challenging. Last, but not least, the set of forms retrieved by theFFC is very heterogeneous—it includes all searchable formsfound during the crawl, and these forms may belong to distinct database domains. For a set of representative databasedomains, on average, only 16% of the forms retrieved by theFFC are actually relevant.

For example, in a crawl to locate airfare search forms, the FFC found 12,893 searchableforms, but among these, only 840 were airfare search forms.In this paper, we present ACHE (Adaptive Crawler forHidden-Web Entries),a new framework that addresses theselimitations. Given a set of Web forms that are entry pointsto online databases,1 ACHE aims to efficiently and automatically locate other forms in the same domain. Our maincontributions are:the FFC and discuss its limitations. In Section 3, we presentthe adaptive-learning framework of ACHE and describe theunderlying algorithms. Our experimental evaluation is discussed in Section 4.

We compare our approach to relatedwork in Section 5 and conclude in Section 6, where we outline directions for future work.2.BACKGROUND: THE FORM-FOCUSEDCRAWLERThe FFC is trained to efficiently locate forms that serve asthe entry points to online databases—it focuses its search bytaking into account both the contents of pages and patternsin and around the hyperlinks in paths to a Web page. Themain components of the FFC are shown in white in Figure 1and are briefly described below.• The page classifier is trained to classify pages as belongingto topics in a taxonomy (e.g., arts, movies, jobs in Dmoz).

Ituses the same strategy as the best-first crawler of [9]. Oncethe crawler retrieves a page P, if P is classified as beingon-topic, its forms and links are extracted.• We frame the problem of searching for forms in a givendatabase domain as a learning task, and present anew framework whereby crawlers adapt to their environments and automatically improve their behaviorby learning from previous experiences. We proposeand evaluate two crawling strategies: a completely automated online search, where a crawler builds a linkclassifier from scratch; and a strategy that combinesoffline and online learning.• The link classifier is trained to identify links that are likelyto lead to pages that contain searchable form interfaces inone or more steps. It examines links extracted from on-topicpages and adds the links to the crawling frontier in the orderof their predicted reward.• The frontier manager maintains a set of priority queueswith links that are yet to be visited.

At each crawling step,it selects the link with the highest priority.• We propose a new algorithm that selects discriminating features of links and uses these features to automatically construct a link classifier.• The searchable form classifier filters out non-searchableforms and ensures only searchable forms are added to theForm Database. This classifier is domain-independent andable to identify searchable forms with high accuracy. Thecrawler also employs stopping criteria to deal with the factthat sites, in general, contain few searchable forms.

Характеристики

Тип файла

PDF-файл

Размер

1,45 Mb

Материал

тематика web-краулеров

Тип материала

Реферат

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Тип файла PDF

PDF-формат наиболее широко используется для просмотра любого типа файлов на любом устройстве. В него можно сохранить документ, таблицы, презентацию, текст, чертежи, вычисления, графики и всё остальное, что можно показать на экране любого устройства. Именно его лучше всего использовать для печати.

Например, если Вам нужно распечатать чертёж из автокада, Вы сохраните чертёж на флешку, но будет ли автокад в пункте печати? А если будет, то нужная версия с нужными библиотеками? Именно для этого и нужен формат PDF - в нём точно будет показано верно вне зависимости от того, в какой программе создали PDF-файл и есть ли нужная программа для его просмотра.

Список файлов реферата

tematika-web-kraulerov.rar

тематика web-краулеров

An Adaptive Crawler ... перевод 4000 знаков.docx

An Adaptive Crawler for Locating Hidden-Web Entry Points (2007).pdf

Crawling AJAX ... перевод 5000 знаков.docx

Crawling AJAX by Inferring User Interface State Changes (2008).pdf

Задание.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.