An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров), страница 5
Описание файла
Файл "An Adaptive Crawler for Locating Hidden-Web Entry Points (2007)" внутри архива находится в папке "тематика web-краулеров". PDF-файл из архива "тематика web-краулеров", который расположен в категории "". Всё это находится в предмете "английский язык" из 9 семестр (1 семестр магистратуры), которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 5 страницы из PDF
To give insight aboutthe behavior of the different crawler configurations, it isuseful to analyze how their harvest rates change over time.Here, we only show these results for Book, Auto and Moviein Figure 5. Similar results were obtained for the other domains.Note that the number of relevant forms retrieved by Online and Baseline coincide initially, and after the crawlerstarts the learning process, the performance of Online improves substantially.
A similar trend is observed for Offlineand Offline-Online—the number of relevant forms retrievedby Offline-Online increases after its policy is first updated.447Feature Selection-URL, AroundWWW 2007 / Track: SearchSession: CrawlersAnother interesting observation is that the rate of increasein the number of relevant forms retrieved is higher for theconfigurations that use online-learning.The increase in the number of forms retrieved over time isa good indication of the effectiveness of online learning for aparticular domain.
For Movie (Figure 5(c)), after crawling66,000 pages, Baseline retrieves only 93 relevant forms. After that, the number of forms remain (almost) constant (asingle additional form is found in the last 30,000 pages). Because so few relevant forms are found, Online is not able tolearn due to insufficient evidence for the link patterns.Notethat the lines for Baseline and Online coincide for the Moviedomain.In the Book domain, which is also sparse but less so thanMovie, Online was able to learn useful patterns and substantially improve the harvest rate. As shown in Figure 5(b), thefirst learning iteration happened after 50,000 pages had beencrawled—much later than for the other denser domains.
Forexample, for Auto the first iteration occurs at 17,000 pages(Figure 5(a)). The Auto domain provides a good exampleof the adaptive behavior of ACHE in denser domains.These results show that, even starting with no informationabout patterns that lead to relevant forms, these patternscan be learned dynamically and crawler performance canbe improved in an automatic fashion. However, the sparserthe domain is, the harder it is for the crawler to learn. ForOnline to be effective in a very sparse domain, a crawlerthat is more focused than Baseline is needed initially.4.3Feature Selection Anchor(a) URLFeature Selection-URL,Around(b) AnchorFeature Selection(c) AroundThe performance improvement obtained by the adaptivecrawler configurations provides evidence of the effectivenessof the automatic feature selection described in Section 3.3.As an example of its operation, consider Figure 6, whichshows the terms selected by AFS in 4 learning iterations forthe Auto domain using the Online configuration.
Note thatAFS is able to identify good features from scratch and without any manual intervention. For both Anchor and Aroundfeature sets, already in the first iteration, relevant terms arediscovered which remain in subsequent iterations (e.g., car,auto), although their frequency changes over time. For instance, the term “auto” has the highest frequency in thefirst iteration, whereas “car” has the highest frequency afterthe second iteration.Unlike Anchor and Around, the URL feature set is notso well-behaved.
Because URLs contain uncommon termsand more variability, the patterns take longer to converge.As Figure 6 shows, after the first iteration the AFS selectsterms that disappear in subsequent iterations (e.g., “index”and “rebuilt”). In addition, the frequencies of terms in theURL are much lower than in the other feature spaces.A final observation about the automatic feature selectionis that by analyzing how the features evolve over time, andchange at each learning iteration, insights can be obtainedabout the adaptive behavior of ACHE .
For example, as Table 2 illustrates, for the Offline-Online in the Book domain,the features selected for the initial link classifier are clearlyrelated to online bookstores (e.g., book, search and bookstor). As new relevant forms are encountered, new termsare introduced that are related to library sites (e.g., ipac,6librari, book, search, and catalog).Figure 6: Features automatically extracted in different iterations of adaptive learning for Online in theAuto domain.5.RELATED WORKThere is a rich literature in the area of focused crawlers(see e.g., [1, 3, 8, 9, 11, 22, 24, 19]).
Closely related to ourwork are strategies that apply online-learning and that takedelayed benefit into account. We discuss these below.Delayed Benefit. Rennie and McCallum [22] frame theproblem of creating efficient crawlers as a reinforcement learning task. They describe an algorithm for learning a functionthat maps hyperlinks to future discounted reward: the expected number of relevant pages that can be found as a result of following that link. They show that by taking delayedrewards into account, their RL Spider is able to efficientlylocate sparse concepts on a predetermined universe of URLs.There are several important differences between the RL Spider and ACHE . First and foremost, the RL Spider does notperform a broad search over the Web.
It requires as inputthe URLs of all sites to be visited and performs a focusedsearch only within these pre-determined sites. Second, it isnot adaptive—the classifier maintains a fixed policy duringthe crawl. Finally, although their classifier also learns toprioritize links and it considers links that have delayed benefit, the learning function used in ACHE is different: thelink classifier estimates the distance between a link and arelevant form (see Section 2.1).The importance of considering delayed benefit in a focused crawl was also discussed by Diligenti et al.
[11]. Their6ipac is a system used by some library sites to search theircatalogs.448WWW 2007 / Track: SearchIteration0 (initial features)12345URLbook,search,addal,natur,histsearch,index,booksearch,adv,lib,index,ipacsearch,lib,ipac,profil,catalogsearch,lib,ipac,profil,catalogsearch,lib,ipac,profil,catalogSession: CrawlersSelected FeaturesAnchorbook,search,addal,bookstor,linksearch,book,advanc,librari,enginsearch,book,advanc,librari,catalogsearch,librari,catalog,advanc,booklibrari,search,catalog,advanc,booklibrari,search,catalog,advanc,bookAroundbook,search,onlin,new,bookstorbook,search,titl,onlin,authorbook,search,librari,titl,authorlibrari,search,book,catalog,titllibrari,search,catalog,book,titllibrari,search,book,catalog,publicTable 2: Features selected during the execution of the Offline-Online in Book domain.Context Focused Crawler (CFC) learns a hierarchy of concepts related to a search topic.
The intuition behind theirapproach is that topically relevant pages can be found byusing a context graph which encodes topics that are directlyor indirectly related to the target topic. By performing abackward crawl from a sample set of relevant pages, theycreate classifiers that identify these related topics and estimate, for a given page r, the distance between r and atopic-relevant page. Unlike ACHE , the focus of the CFC issolely based on the contents of pages—it does not prioritizelinks. If a page is considered relevant, all links in that pagewill be followed. Like the RL Spider, the CFC uses a fixedfocus strategy.The FFC [3] combines ideas from these two approaches.It employs two classifiers: one that uses page contents thatfocuses the search on a given topic; and another that identifies promising links within the focus topic.
An experimentalevaluation over three distinct domains showed that combining the page contents and hyperlink structure leads to substantial improvement in crawler efficiency. The differencesbetween FFC and ACHE are discussed in Section 3.Unlike these approaches, ACHE adaptively updates itsfocus strategy as it learns from new experience. There is animportant benefit derived from combining delayed benefitand online-learning: following links that have delayed benefit forces the crawler to explore new paths and enables it tolearn new patterns. This is in contrast to strategies based onimmediate benefit that exploit actions it has already learnedwill yield high reward.
This exploratory behavior makes ouradaptive learning strategy robust and enables it to correctbiases created in the learning process. This was observedin several of the domains we considered in our experimentalevaluation (Section 4). In the Book domain, for instance, thecrawler with a fixed policy (Offline) was trapped in a subset of promising links related to online bookstores, whereasACHE was able to eliminate this bias in the first learning iteration and learn new patterns that allowed it to also obtainrelevant forms from online library sites.Online Learning Policies.
Chakrabarti et al. [8] proposedan online learning strategy that, similar to ACHE uses twoclassifiers to focus the search: a baseline page classifier thatlearns to classify pages as belonging to topics in a taxonomy [9]; and the apprentice, a classifier that learns to identify the most promising links in a topic-relevant page. Theirmotivation to use the apprentice comes from the observationthat, even in domains that are not very narrow, the numberof links that are irrelevant to the topic can be very high.Thus, following all the links in a topic-relevant page can bewasteful. The baseline classifier captures the user’s specification of the topic and functions as a critic of the apprentice,by giving feedback about its choices.
The apprentice, usingthis feedback, learns the features of good links and is re-sponsible for prioritizing the links in the crawling frontier.Although ACHE also attempts to estimate the benefit offollowing a particular link based on the crawler experience,there is an important difference. Because the apprenticeonly follows links that give immediate benefit, biases thatare introduced by the online learning process are reinforcedas the crawl progresses—the crawler will repeatedly exploitactions it has already learned will yield high reward.
In contrast, as discussed above, by considering links that may leadto delayed benefit, ACHE has a more exploratory behaviorand will visit unknown states and actions.It is worthy of note that the goal of our link classifier iscomplementary to that of the apprentice—it aims to learnwhich links lead to pages that contain relevant forms, whereasthe goal of the apprentice is to avoid off-topic pages. In addition, we are dealing with concepts that are much sparserthan the ones considered by Chakrabarti et al.. For example, the density of the domains considered in [8] varied from31% to 91%, whereas for concepts we considered densityvalues range between 0.094% and 1.857%. Nonetheless, ourapproach is likely to benefit from such an apprentice, sinceit would reduce the number of off-topic pages retrieved andimprove the overall crawling efficiency.