Главная » Все файлы » Просмотр файлов из архивов » PDF-файлы » An Adaptive Crawler for Locating Hidden-Web Entry Points (2007)

An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров), страница 5

PDF-файл An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров), страница 5 Английский язык (62809): Реферат - 9 семестр (1 семестр магистратуры)An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров) - PDF, страница 5 (62809) - СтудИзба2020-08-17СтудИзба

Описание файла

Файл "An Adaptive Crawler for Locating Hidden-Web Entry Points (2007)" внутри архива находится в папке "тематика web-краулеров". PDF-файл из архива "тематика web-краулеров", который расположен в категории "". Всё это находится в предмете "английский язык" из 9 семестр (1 семестр магистратуры), которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .

Просмотр PDF-файла онлайн

Текст 5 страницы из PDF

To give insight aboutthe behavior of the different crawler configurations, it isuseful to analyze how their harvest rates change over time.Here, we only show these results for Book, Auto and Moviein Figure 5. Similar results were obtained for the other domains.Note that the number of relevant forms retrieved by Online and Baseline coincide initially, and after the crawlerstarts the learning process, the performance of Online improves substantially.

A similar trend is observed for Offlineand Offline-Online—the number of relevant forms retrievedby Offline-Online increases after its policy is first updated.447Feature Selection-URL, AroundWWW 2007 / Track: SearchSession: CrawlersAnother interesting observation is that the rate of increasein the number of relevant forms retrieved is higher for theconfigurations that use online-learning.The increase in the number of forms retrieved over time isa good indication of the effectiveness of online learning for aparticular domain.

For Movie (Figure 5(c)), after crawling66,000 pages, Baseline retrieves only 93 relevant forms. After that, the number of forms remain (almost) constant (asingle additional form is found in the last 30,000 pages). Because so few relevant forms are found, Online is not able tolearn due to insufficient evidence for the link patterns.Notethat the lines for Baseline and Online coincide for the Moviedomain.In the Book domain, which is also sparse but less so thanMovie, Online was able to learn useful patterns and substantially improve the harvest rate. As shown in Figure 5(b), thefirst learning iteration happened after 50,000 pages had beencrawled—much later than for the other denser domains.

Forexample, for Auto the first iteration occurs at 17,000 pages(Figure 5(a)). The Auto domain provides a good exampleof the adaptive behavior of ACHE in denser domains.These results show that, even starting with no informationabout patterns that lead to relevant forms, these patternscan be learned dynamically and crawler performance canbe improved in an automatic fashion. However, the sparserthe domain is, the harder it is for the crawler to learn. ForOnline to be effective in a very sparse domain, a crawlerthat is more focused than Baseline is needed initially.4.3Feature Selection Anchor(a) URLFeature Selection-URL,Around(b) AnchorFeature Selection(c) AroundThe performance improvement obtained by the adaptivecrawler configurations provides evidence of the effectivenessof the automatic feature selection described in Section 3.3.As an example of its operation, consider Figure 6, whichshows the terms selected by AFS in 4 learning iterations forthe Auto domain using the Online configuration.

Note thatAFS is able to identify good features from scratch and without any manual intervention. For both Anchor and Aroundfeature sets, already in the first iteration, relevant terms arediscovered which remain in subsequent iterations (e.g., car,auto), although their frequency changes over time. For instance, the term “auto” has the highest frequency in thefirst iteration, whereas “car” has the highest frequency afterthe second iteration.Unlike Anchor and Around, the URL feature set is notso well-behaved.

Because URLs contain uncommon termsand more variability, the patterns take longer to converge.As Figure 6 shows, after the first iteration the AFS selectsterms that disappear in subsequent iterations (e.g., “index”and “rebuilt”). In addition, the frequencies of terms in theURL are much lower than in the other feature spaces.A final observation about the automatic feature selectionis that by analyzing how the features evolve over time, andchange at each learning iteration, insights can be obtainedabout the adaptive behavior of ACHE .

For example, as Table 2 illustrates, for the Offline-Online in the Book domain,the features selected for the initial link classifier are clearlyrelated to online bookstores (e.g., book, search and bookstor). As new relevant forms are encountered, new termsare introduced that are related to library sites (e.g., ipac,6librari, book, search, and catalog).Figure 6: Features automatically extracted in different iterations of adaptive learning for Online in theAuto domain.5.RELATED WORKThere is a rich literature in the area of focused crawlers(see e.g., [1, 3, 8, 9, 11, 22, 24, 19]).

Closely related to ourwork are strategies that apply online-learning and that takedelayed benefit into account. We discuss these below.Delayed Benefit. Rennie and McCallum [22] frame theproblem of creating efficient crawlers as a reinforcement learning task. They describe an algorithm for learning a functionthat maps hyperlinks to future discounted reward: the expected number of relevant pages that can be found as a result of following that link. They show that by taking delayedrewards into account, their RL Spider is able to efficientlylocate sparse concepts on a predetermined universe of URLs.There are several important differences between the RL Spider and ACHE . First and foremost, the RL Spider does notperform a broad search over the Web.

It requires as inputthe URLs of all sites to be visited and performs a focusedsearch only within these pre-determined sites. Second, it isnot adaptive—the classifier maintains a fixed policy duringthe crawl. Finally, although their classifier also learns toprioritize links and it considers links that have delayed benefit, the learning function used in ACHE is different: thelink classifier estimates the distance between a link and arelevant form (see Section 2.1).The importance of considering delayed benefit in a focused crawl was also discussed by Diligenti et al.

[11]. Their6ipac is a system used by some library sites to search theircatalogs.448WWW 2007 / Track: SearchIteration0 (initial features)12345URLbook,search,addal,natur,histsearch,index,booksearch,adv,lib,index,ipacsearch,lib,ipac,profil,catalogsearch,lib,ipac,profil,catalogsearch,lib,ipac,profil,catalogSession: CrawlersSelected FeaturesAnchorbook,search,addal,bookstor,linksearch,book,advanc,librari,enginsearch,book,advanc,librari,catalogsearch,librari,catalog,advanc,booklibrari,search,catalog,advanc,booklibrari,search,catalog,advanc,bookAroundbook,search,onlin,new,bookstorbook,search,titl,onlin,authorbook,search,librari,titl,authorlibrari,search,book,catalog,titllibrari,search,catalog,book,titllibrari,search,book,catalog,publicTable 2: Features selected during the execution of the Offline-Online in Book domain.Context Focused Crawler (CFC) learns a hierarchy of concepts related to a search topic.

The intuition behind theirapproach is that topically relevant pages can be found byusing a context graph which encodes topics that are directlyor indirectly related to the target topic. By performing abackward crawl from a sample set of relevant pages, theycreate classifiers that identify these related topics and estimate, for a given page r, the distance between r and atopic-relevant page. Unlike ACHE , the focus of the CFC issolely based on the contents of pages—it does not prioritizelinks. If a page is considered relevant, all links in that pagewill be followed. Like the RL Spider, the CFC uses a fixedfocus strategy.The FFC [3] combines ideas from these two approaches.It employs two classifiers: one that uses page contents thatfocuses the search on a given topic; and another that identifies promising links within the focus topic.

An experimentalevaluation over three distinct domains showed that combining the page contents and hyperlink structure leads to substantial improvement in crawler efficiency. The differencesbetween FFC and ACHE are discussed in Section 3.Unlike these approaches, ACHE adaptively updates itsfocus strategy as it learns from new experience. There is animportant benefit derived from combining delayed benefitand online-learning: following links that have delayed benefit forces the crawler to explore new paths and enables it tolearn new patterns. This is in contrast to strategies based onimmediate benefit that exploit actions it has already learnedwill yield high reward.

This exploratory behavior makes ouradaptive learning strategy robust and enables it to correctbiases created in the learning process. This was observedin several of the domains we considered in our experimentalevaluation (Section 4). In the Book domain, for instance, thecrawler with a fixed policy (Offline) was trapped in a subset of promising links related to online bookstores, whereasACHE was able to eliminate this bias in the first learning iteration and learn new patterns that allowed it to also obtainrelevant forms from online library sites.Online Learning Policies.

Chakrabarti et al. [8] proposedan online learning strategy that, similar to ACHE uses twoclassifiers to focus the search: a baseline page classifier thatlearns to classify pages as belonging to topics in a taxonomy [9]; and the apprentice, a classifier that learns to identify the most promising links in a topic-relevant page. Theirmotivation to use the apprentice comes from the observationthat, even in domains that are not very narrow, the numberof links that are irrelevant to the topic can be very high.Thus, following all the links in a topic-relevant page can bewasteful. The baseline classifier captures the user’s specification of the topic and functions as a critic of the apprentice,by giving feedback about its choices.

The apprentice, usingthis feedback, learns the features of good links and is re-sponsible for prioritizing the links in the crawling frontier.Although ACHE also attempts to estimate the benefit offollowing a particular link based on the crawler experience,there is an important difference. Because the apprenticeonly follows links that give immediate benefit, biases thatare introduced by the online learning process are reinforcedas the crawl progresses—the crawler will repeatedly exploitactions it has already learned will yield high reward.

In contrast, as discussed above, by considering links that may leadto delayed benefit, ACHE has a more exploratory behaviorand will visit unknown states and actions.It is worthy of note that the goal of our link classifier iscomplementary to that of the apprentice—it aims to learnwhich links lead to pages that contain relevant forms, whereasthe goal of the apprentice is to avoid off-topic pages. In addition, we are dealing with concepts that are much sparserthan the ones considered by Chakrabarti et al.. For example, the density of the domains considered in [8] varied from31% to 91%, whereas for concepts we considered densityvalues range between 0.094% and 1.857%. Nonetheless, ourapproach is likely to benefit from such an apprentice, sinceit would reduce the number of off-topic pages retrieved andimprove the overall crawling efficiency.

Свежие статьи
Популярно сейчас
Как Вы думаете, сколько людей до Вас делали точно такое же задание? 99% студентов выполняют точно такие же задания, как и их предшественники год назад. Найдите нужный учебный материал на СтудИзбе!
Ответы на популярные вопросы
Да! Наши авторы собирают и выкладывают те работы, которые сдаются в Вашем учебном заведении ежегодно и уже проверены преподавателями.
Да! У нас любой человек может выложить любую учебную работу и зарабатывать на её продажах! Но каждый учебный материал публикуется только после тщательной проверки администрацией.
Вернём деньги! А если быть более точными, то автору даётся немного времени на исправление, а если не исправит или выйдет время, то вернём деньги в полном объёме!
Да! На равне с готовыми студенческими работами у нас продаются услуги. Цены на услуги видны сразу, то есть Вам нужно только указать параметры и сразу можно оплачивать.
Отзывы студентов
Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.
Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.
Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.
Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.
Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.
Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.
Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.
Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.
Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.
Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.
Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.
Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.
Популярные преподаватели
Добавляйте материалы
и зарабатывайте!
Продажи идут автоматически
5288
Авторов
на СтудИзбе
417
Средний доход
с одного платного файла
Обучение Подробнее