An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904), страница 2

Файл №1176904 An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров) 2 страницаAn Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904) страница 22020-08-172020-08-17СтудИзба

тематика web-краулеров

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 2)

It leavesa site after retrieving a pre-defined number of distinct forms,or after it visits a pre-defined number of pages in the site.• We extend the crawling process with a new modulethat accurately determines the relevance of retrievedforms with respect to a particular database domain.The notion of relevance of a form is user-defined. Thiscomponent is essential for the effectiveness of onlinelearning and it greatly improves the quality of the setof forms retrieved by the crawler.These components and their implementation are describedin [3].

Below we discuss the aspects of the link classifier andfrontier manager needed to understand the adaptive learningmechanism of ACHE .We have performed an extensive performance evaluationof our crawling framework over real Web data in eight representative domains. This evaluation shows that the ACHElearning strategy is effective—the crawlers are able to adaptand significantly improve their harvest rates as the crawlprogresses. Even starting from scratch (without a link classifier), ACHE is able to obtain harvest rates that are comparable to those of crawlers like the FFC, that are constructedusing prior knowledge.

The results also show that ACHEis effective and obtains harvest rates that are substantiallyhigher than a crawler whose focus is only on page content—these differences are even more pronounced when only relevant forms (i.e., forms belong to the target database domain)are considered.

Finally, the results also indicate that the automated feature selection is able to identify good features,which for some domains were more effective than featuresidentified manually.The remainder of the paper is organized as follows. SinceACHE extends the focus strategy of the FFC, to make thepaper self-contained, in Section 2 we give a brief overview of2.1Link ClassifierSince forms are sparsely distributed on the Web, by prioritizing only links that bring immediate return, i.e., linkswhose patterns are similar to those that point to pages containing searchable forms, the crawler may miss target pagesthat can only be reached with additional steps.

The linkclassifier aims to also identify links that have delayed benefitand belong to paths that will eventually lead to pages thatcontain forms. It learns to estimate the distance (the lengthof the path) between a link and a target page based on linkpatterns: given a link, the link classifier assigns a score tothe link which corresponds to the distance between the linkand a page that contains a relevant form.In the FFC, the link classifier is built as follows.

Given aset of URLs of pages that contain forms in a given databasedomain, paths to these pages are obtained by crawling backwards from these pages, using the link: facility providedby search engines such as Google and Yahoo! [6]. The backward crawl proceeds in a breadth-first manner. Each level1In this paper, we use the terms ’online database’ and’hidden-Web source’ interchangeably.442WWW 2007 / Track: SearchSession: CrawlersSearchableCrawlerMost relevantlinkFrontierManagerPagePageClassifier(Link,Relevance)FormsSearchable FormsFormClassifierDomain-SpecificFormClassifierRelevantFormsFormDatabaseForm FilteringLinksForm pathAdaptiveLinkLearnerLinkClassifierFeatureSelectionFigure 1: Architecture of ACHE .

The new modules that are responsible for the online focus adaptation areshown in blue; and the modules shown in white are used both in the FFC and in ACHE .l+1 is constructed by retrieving all documents that pointto documents in level l. From the set of paths gathered,we manually select the best features. Using these data, theclassifier is trained to estimate the distance between a givenlink and a target page that contains a searchable form. Intuitively, a link that matches the features of level 1 is likelyto point to a page that contains a form; and a link thatmatches the features of level l is likely l steps away from apage that contains a form.2.2that focuses only on immediate benefit).

The improvementsin harvest rate for the multi-level configurations varied between 20% and 110% for the three domains we considered.This confirms results obtained in other works which underline the importance of taking delayed benefit into accountfor sparse concepts [11, 22].The strategy used by the FFC has two important limitations. The set of forms retrieved by the FFC is highlyheterogeneous. Although the Searchable Form Classifier isable to filter out non-searchable forms with high accuracy,a qualitative analysis of the searchable forms retrieved bythe FFC showed that the set contains forms that belong tomany different database domains.

The average percentageof relevant forms (i.e., forms that belong to the target domain) in the set was low—around 16%. For some domainsthe percentage was as low as 6.5%. Whereas it is desirableto list only relevant forms in online database directories,such as BrightPlanet [7] and the Molecular Biology DatabaseCollection [13], for some applications this is a requirement.Having a homogeneous set of the forms that belong to thesame database domain is critical for techniques such as statistical schema matching across Web interfaces [16], whoseeffectiveness can be greatly diminished if the set of inputforms is noisy and contains forms from multiple domains.Another limitation of the FFC is that tuning the crawlerand training the link classifier can be time consuming. Theprocess used to select the link classifier features is manual:terms deemed as representative are manually selected foreach level.

The quality of these terms is highly-dependenton knowledge of the domain and on whether the set of pathsobtained in the back-crawl is representative of a wider segment of the Web for that database domain. If the link classifier is not built with a representative set of paths for a givendatabase domain, because the FFC uses a fixed focus strategy, the crawler will be confined to a possibly small subsetof the promising links in the domain.Frontier ManagerThe goal of the frontier manager is to maximize the expected reward for the crawler.

Each link in the frontier isrepresented by a tuple (link, Q), where Q reflects the expected reward for link:Q(state, link) = reward(1)Q maps a state (the current crawling frontier) and a linklink to the expected reward for following link. The valueof Q is approximated by discretization and is determinedby: (1) the distance between link and the target pages—links that are closer to the target pages have a higher Qvalue and are placed in the highest priority queues; (2) thelikelihood of link belonging to a given level.The frontier manager is implemented as a set of N queues,where each queue corresponds to a link classifier level: a linkl is placed on queue i if the link classifier estimates l is i stepsfrom a targe page.

Within a queue, links are ordered basedon their likelihood of belonging to the level associated withthe queue.Although the goal of frontier manager is to maximize theexpected reward, if it only chooses links that give the bestexpected rewards, it may forgo links that are sub-optimalbut that lead to high rewards in the future. To ensure thatlinks with delayed benefit are also selected, the crawlingfrontier is updated in batches. When the crawler starts, allseeds are placed in queue 1.

At each step, the crawler selectsthe link with the highest relevance score from the first nonempty queue. If the page it downloads belongs to the targettopic, its links are classified by link classifier and added toa separate persistent frontier. Only when the queues in thecrawling frontier become empty, the crawler loads the queuesfrom the persistent frontier.2.33.DYNAMICALLY ADAPTING THECRAWLER FOCUSWith the goal of further improving crawler efficiency, thequality of its results, and automating the process of crawlersetup and tuning, we use a learning-agent-based approachto the problem of locating hidden-Web entry points.Learning agents have four components [23]:Limitations of FFCAn experimental evaluation of the FFC [3] showed thatFFC is more efficient and retrieves up to an order of magnitude more searchable forms than a crawler that focusesonly on topic.

In addition, FFC configurations with a linkclassifier that uses multiple levels performs uniformly better than their counterpart with a single level (i.e., a crawler• The behavior generating element (BGE), which based onthe current state, selects an action that tries to maximize theexpected reward taking into account its goals (exploitation);• The problem generator (PG) that is responsible for suggesting actions that will lead to new experiences, even if the443BGEunknown actionPGWWW 2007 / Track: SearchningBGEknown actionAlgorithm 1 Adaptive Link Learner1: if learningT hresholdReached then2: paths = collectP aths(relevantF orms, length)PGknown actionunknown actionCriticBGEcengSession: CrawlersPGunknown action3:successfulactionsOnlineLearning4:5:Figure 2: Highlight of the main components involved in the adaptive aspect of a learning agent.6:The policy used by the frontier manager is set by the linkclassifier. In ACHE , we employ the adaptive link learneras the learning element.

It dynamically learns features automatically extracted from successful paths by the featureselection component, and updates the link classifier. Theeffectiveness of the adaptive link learner depends on the accuracy of the form-filtering process; on the ability of thefeature selector to identify ’good’ features; and on the efficacy of the frontier manager in balancing exploration andexploitation. Below we describe the components and algorithms responsible for making ACHE adaptive.benefit is not immediate, i.e., the decision is locally suboptimal (exploration);• The critic that gives the online learning element feedbackon the success (or failure) of its actions; and• The online learning element which takes the critic’s feedback into account to update the policy used by the BGE.A learning agent must be able to learn from new experiences and, at the same time, it should be robust with respectto biases that may be present in these experiences [20, 23].An agent’s ability to learn and adapt relies on the successfulinteraction among its components (see Figure 2).

Withoutexploration, an agent may not be able to correct biases introduced during its execution. If the BGE is ineffective, theagent is not able to exploit its acquired knowledge. Finally, ahigh-quality critic is crucial to prevent the agent from drifting away from its objective. As we discuss below, ACHEcombines these four elements to obtain all the advantages ofusing a learning agent.3.2Adaptive Link LearnerIn the FFC, link patterns are learned offline. As describedin Section 2.1, these patterns are obtained from paths derived by crawling backwards from a set of pages that containrelevant forms. The adaptive link learner, in contrast, usesfeatures of paths that are gathered during the crawl.

Характеристики

Тип файла

PDF-файл

Размер

1,45 Mb

Материал

тематика web-краулеров

Тип материала

Реферат

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов реферата

tematika-web-kraulerov.rar

тематика web-краулеров

An Adaptive Crawler ... перевод 4000 знаков.docx

An Adaptive Crawler for Locating Hidden-Web Entry Points (2007).pdf

Crawling AJAX ... перевод 5000 знаков.docx

Crawling AJAX by Inferring User Interface State Changes (2008).pdf

Задание.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.