An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904), страница 4

Файл №1176904 An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров) 4 страницаAn Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904) страница 42020-08-172020-08-17СтудИзба

тематика web-краулеров

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 4)

Experiments conductedby Yang and Pedersen [27] show that DF obtains resultscomparable to task-sensitive feature selection approaches,as information gain [20] and Chi-square [12].AFS is very simple to implement, and as our experimentalresults show, it is very effective in practice.3.44.EXPERIMENTSWe have performed an extensive performance evaluationof our crawling framework over real Web data in eight representative domains. Besides analyzing the overall performance of our approach, our goals included: evaluating theeffectiveness of ACHE in obtaining high-quality results (i.e.,in retrieving relevant forms); the quality of the features automatically selected by AFS ; and assessing the effectivenessof online learning in the crawling process.Form FilteringThe form filtering component acts as a critic and is responsible for identifying relevant forms gathered by ACHE .It assists ACHE in obtaining high-quality results and it also4Document frequency represents the number of documentsin a collection where a given term occurs.445Forms retrievedWWW 2007 / Track: SearchDomainAirfareAutoBookHotelJobMovieMusicRentalDescriptionairfare searchused carsbooks searchhotel availabilityjob searchmovie titles and DVDsmusic CDscar rental availabilitySession: CrawlersDensity0.132%0.962%0.142%1.857%0.571%0.094%0.297%0.148%Norm.

Density1.40410.2341.51019.7556.0741.0003.1591.574Table 1: Database domains used in experiments anddensity of forms in these domains. The column labeled Norm. Density shows the density values normalized with respect to the lowest density value (forthe Movie domain).4.1Figure 3: Number of relevant forms returned by thedifferent crawler configurations.Experimental Setupcrawled. It is worthy of note that harvest rates reportedin [3] for the FFC (offline learning) took into account allsearchable forms retrieved—a superset of the relevant forms.Below, as a point of comparison, we also show the harvestrates for the different crawlers taking all searchable formsinto account.Database Domains. We evaluated our approach over theeight online database domains described in Table 1. Thistable also shows the density of relevant forms in the domains.

Here, we measure density as the number of distinctrelevant forms retrieved by a topic-focused crawler (the baseline crawler described below) divided by the total number ofpages crawled. Note that not only are forms very sparselydistributed in these domains, but also that there is a largevariation in density across domains. In the least dense domain (Movie), only 94 forms are found after the baselinecrawler visits 100,000 pages; whereas in the densest domain(Hotel), the same crawler finds 19 times as many forms (1857forms).4.2Crawling Strategies. To evaluate the benefit of onlinelearning in ACHE , we ran the following crawler configurations:• Baseline, a variation of the best-first crawler [9]. Thepage classifier guides the search and the crawler follows all links that belong to a page whose contents areclassified as being on-topic. One difference betweenbaseline and the best-first crawler is that the formeruses the same stopping criteria as the FFC; 5• Offline Learning, the crawler operates using a fixedpolicy that remains unchanged during the crawlingprocess—this is the same strategy used by the FFC [3];• Offline-Online Learning, ACHE starts with a pre-definedpolicy, and this policy is dynamically updated as thecrawl progresses;• Online Learning, ACHE starts using the baseline strategy and builds its policy dynamically, as pages arecrawled.All configurations were run over one hundred thousandpages; and the link classifiers were configured with threelevels.Effectiveness measure.

Since our goal is to find searchable forms that serve as entry points to a given database domain, it is important to measure harvest rate of the crawlersbased on the number of relevant forms retrieved per pages5In earlier experiments, we observed that without the appropriate stopping criteria, the best-first crawler gets trapped insome sites, leading to extremely low harvest rates [3].446Focus Adaptation and Crawler EfficiencyFigure 3 gives, for each domain, the number of relevantforms retrieved by the four crawler configurations. Onlinelearning leads to substantial improvements in harvest rateswhen applied to both the Baseline and Offline configurations. The gains vary from 34% to 585% for Online overBaseline, and from 4% to 245% for Offline-Online over Offline.

These results show that the adaptive learning component of ACHE is able to automatically improve its focus based on the feedback provided by the form filteringcomponent. In addition, Online is able to obtain substantial improvements over Baseline in a completely automatedfashion—requiring no initial link classifier and greatly reducing the effort to configure the crawler.

The only exception isthe Movie domain. For Movie, the most sparse domain weconsidered, the Online configuration was not able to learnpatterns with enough support from the 94 forms encounteredby Baseline.Effect of Prior Knowledge. Having background knowledge in the form of a ’good’ link classifier is beneficial. Thiscan be seen from the fact that Offline-Online retrieved thelargest number of relevant forms in all domains (except forRental, see discussion below).

This knowledge is especiallyuseful for very sparse domains, where the learning processcan be prohibitively expensive due to the low harvest rates.There are instances, however, where the prior knowledgelimits the crawler to visit a subset of the productive links.If the set of patterns in the initial link classifier is too narrow, it will prevent the crawler from visiting other relevantpages reachable through paths that are not represented inthe link classifier. Consider, for example the Rental domain,where Online outperforms Offline-Online. This behaviormay sound counter-intuitive, since both configurations apply online learning and Offline-Online starts with an advantage. The initial link classifier used by Offline-Online wasbiased, and the adaptive process was slow at correcting thisbias.

A closer examination of the features used by OfflineOnline shows that, over time, they converge to the sameset of features of Online. The Online, in contrast, startedwith no bias and was able to outperform Offline-Online ina window of 100,000 pages.AutoWWW 2007 / Track: SearchSession: CrawlersBook(a) AutoFigure 4: Relative performance of Offline-Onlineover Baseline. The domains are ordered with respect to their densities.The presence of bias in the link classifier also explainsthe poor performance of Offline in Rental, Book and Airfare.

For these domains, Offline-Online is able to eliminatethe initial bias. ACHE automatically adapts and learns newpatterns, leading to a substantial increase the number of relevant forms retrieved. In the Book domain, for instance, theinitial link classifier was constructed using manually gathered forms from online bookstores. Examining the formsobtained by the Offline-Online, we observed that forms foronline bookstores are only a subset of the relevant forms inthis domain. A larger percentage of relevant forms actuallyappear in library sites. ACHE successfully learned patternsto these sites (see Table 2).Another evidence of the effectiveness of the adaptive learning strategy is the fact that Online outperforms Offline forfour domains: Airfare, Auto, Book, and Rental.

For the latter two, Online retrieved 275% and 190% (resp.) more formsthan Offline. This indicates that a completely automatedapproach to learning is effective and able to outperform amanually configured crawler.Movie(b) Book(c) MovieFigure 5: Number of forms retrieved over time.The Link Classifier and Delayed Benefit. Figure 4shows the relative performance between the Offline-Onlineconfiguration of ACHE and Baseline, with respect to bothrelevant forms and searchable forms. Here, the domains areordered (in the x axis) by increasing order of density.

Notethat for the sparser domains, the performance difference between ACHE and Baseline is larger. Also note that thegains from delayed benefit are bigger when the performanceis measured with respect to relevant forms. For example, inthe Book domain, Offline-Online retrieves almost 9 timesmore relevant forms than Baseline.

The performance difference is much smaller for searchable forms—Offline-Onlineretrieves only 10% more searchable forms than Baseline.This can be explained due to the fact that searchable formsare much more prevalent than relevant forms within a focustopic. The numbers in Figure 4 underline the importanceof taking delayed benefit into account while searching forsparse concepts.Delayed benefit also plays an important role in the effectiveness of the adaptive learning component of ACHE .

Theuse of the link classifier forces ACHE to explore paths withpreviously unknown patterns. This exploratory behavioris key to adaptation. For example, in the Book domain (seeTable 2), since the initial link classifier has a bias towardsonline bookstores, if ACHE only followed links predicted toyield immediate benefit, it would not be able to reach thelibrary sites. Note, however, that the exploratory behaviorcan potentially lead the crawler to lose its focus. But as ourexperimental results show, ACHE is able to obtain a goodbalance, being able to adapt to new patterns while maintaining its focus.Crawler Performance over Time.

Характеристики

Тип файла

PDF-файл

Размер

1,45 Mb

Материал

тематика web-краулеров

Тип материала

Реферат

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов реферата

tematika-web-kraulerov.rar

тематика web-краулеров

An Adaptive Crawler ... перевод 4000 знаков.docx

An Adaptive Crawler for Locating Hidden-Web Entry Points (2007).pdf

Crawling AJAX ... перевод 5000 знаков.docx

Crawling AJAX by Inferring User Interface State Changes (2008).pdf

Задание.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.