An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904), страница 3

Файл №1176904 An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров) 3 страницаAn Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904) страница 32020-08-172020-08-17СтудИзба

тематика web-краулеров

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 3)

ACHEkeeps a repository of successful paths: when it identifies arelevant form, it adds the path it followed to that form tothe repository. Its operation is described in Algorithm 1.The adaptive link learner is invoked periodically, when thelearning threshold is reached (line 1). For example, after thecrawler visits a pre-determined number of pages, or after itis able to retrieve a pre-defined number of relevant forms.Note that if the threshold is too low, the crawler may notbe able to retrieve enough new samples to learn effectively.On the other hand, if the value is too high, the learningrate will be slow. In our experiments, learning iterationsare triggered after 100 new relevant forms are found.When a learning iteration starts, features are automatically extracted from the new paths (Section 3.3). Usingthese features and the set of path instances, the adaptivelink learner generates a new link classifier.3 As the laststep, the link learner updates the frontier manager with thenew link classifier.

The frontier manager then updates theQ values of the links using the new link classifier, i.e., itre-ranks all links in the frontier using the new policy.3.1The ACHE ArchitectureFigure 1 shows the high-level architecture of ACHE . Thecomponents that we added to enable the crawler to learnfrom its experience are highlighted (in blue). The frontiermanager (Section 2.2) acts as both the BGE and PG andbalances the trade-off between exploration and exploitation.It does so by using a policy for selecting unvisited links fromthe crawling frontier which considers links with both immediate and delayed benefit.

The Q function (Equation 1)provides the exploitation component (BGE). It ensures thecrawler exploits the acquired knowledge to select actionsthat yield high reward, i.e., links that lead to relevant forms.By also selecting links estimated to have delayed reward, thefrontier manager provides an exploratory component (PG),which enables the crawler to explore actions with previouslyunknown patterns.

This exploratory behavior makes ACHErobust and enables it to correct biases that may be introduced in its policy. We discuss this issue in more detail inSection 4.The form filtering component is the critic. It consists oftwo classifiers: the searchable form classifier (SFC)2 ; andthe domain-specific form classifier (DSFC).

Forms are processed by these classifiers in a sequence: each retrieved formis first classified by the SFC as searchable or non-searchable;the DSFC then examines the searchable forms and indicateswhether they belong to the target database domain (see Section 3.4 for details).2{Collect paths of a given length to pages that contain relevant forms.}f eatures = F eatureSelector(paths){Select the features from the neighborhood of links in thepaths.}linkClassif ier = createClassif ier(f eatures, paths){Create new link classifier.}updateF rontier(linkClassif ier){Re-rank links in the frontier using the new link classifier.}end if3.3Automating the Feature Selection ProcessThe effectiveness of the link classifier is highly-dependenton the ability to identify discriminating features of links.

InACHE , these features are automatically extracted, as described in Algorithm 2. The Automatic Feature Selection(AFS ) algorithm extracts features present in the anchor,URL, and text around links that belong to paths which leadto relevant forms.3The length of the paths considered depends on the numberof levels used in the link classifier.The SFC is also used in the FFC.444WWW 2007 / Track: SearchSession: CrawlersAlgorithm 2 Automatic Feature Selection1: Input: set of links at distance d from a relevant form2: Output: features selected in the three feature spaces—enables the crawler to adaptively update its focus strategy,as it identifies new paths to relevant forms during a crawl.Therefore, the overall performance of the crawler agent ishighly-dependent on the accuracy of the form-filtering process.

If the classifiers are inaccurate, crawler efficiency canbe greatly reduced as it drifts way from its objective throughunproductive paths.The form-filtering process needs to identify, among theset of forms retrieved by the crawler, forms that belong tothe target database domain. Even a focused crawler retrieves a highly-heterogeneous set of forms. A focus topic(or concept) may encompass pages that contain many different database domains.

For example, while crawling tofind airfare search interfaces the FFC also retrieves a largenumber of forms for rental car and hotel reservation, sincethese are often co-located with airfare search interfaces intravel sites. The retrieved forms also include non-searchableforms that do not represent database queries such as formsfor login, mailing list subscriptions, and Web-based emailforms.ACHE uses HIFI, a hierarchical classifier ensemble proposed in [4], to filter out irrelevant forms. Instead of usinga single, complex classifier, HIFI uses two simpler classifiersthat learn patterns of different subsets of the form featurespace. The Generic Form Classifier (GF C) uses structuralpatterns which determine whether a form is searchable.

Empirically, we have observed that these structural characteristics of a form are a good indicator as to whether the formis searchable or not [3]. To identify searchable forms thatbelong to a given domain, HIFI uses a more specialized classifier, the Domain-Specific Form Classifier (DSF C). TheDSFC uses the textual content of a form to determine itsdomain. Intuitively, the form content is often a good indicator of the database domain—it contains metadata and datathat pertain to the database.By partitioning the feature space, not only can simplerclassifiers be constructed that are more accurate and robust,but this also enables the use of learning techniques that aremore effective for each feature subset. Whereas decisiontrees [20] gave the lowest error rates for determining whethera form is searchable based on structural patterns, SVMs [20]proved to be the most effective technique to identify formsthat belong to the given database domain based on theirtextual content.The details of these classifiers are out of the scope of thispaper.

They are described in [4], where we show that thecombination of the two classifiers leads to very high precision, recall and accuracy. The effectiveness of the formfiltering component is confirmed by our experimental evaluation (Section 4): significant improvements in harvest ratesare obtained by the adaptive crawling strategies. For thedatabase domains used in this evaluation, the combinationof these two classifiers results in accuracy values above 90%.anchor, URL and around3: for each featureSpace do4: termSet = getTermSet(featureSpace, paths)5:6:7:8:9:10:11:12:13:14:15:16:{From the paths, obtain terms in specified feature space.}termSet = removeStopWords(termSet)stemmedSet = stem(termSet)if featureSpace == URL thentopKTerms= getMostFrequentTerms(stemmedSet, k){Obtain the set of k most frequent terms.}for each term t ∈ topKTerms dofor each term t0 ∈ stemmedSet that contains the substring t doaddFrequency(stemmedSet,t,t0 ){Add frequency of t0 to t in stemmedSet.}end forend forend ifselectedFeatures = getNMostFrequentTerms(termSet){Obtain a set of the top n terms.}end forInitially, all terms in anchors are extracted to constructthe anchor feature set.

For the around feature set, AFS selects the n terms that occur before and the n terms thatoccur after the anchor (in textual order). Because the number of extracted terms in these different contexts tends tobe large, stop-words are removed (line 5) and the remainingterms are stemmed (line 6). The most frequent terms arethen selected to construct the feature set (line 15).The URL feature space requires special handling. Sincethere is little structure in a URL, extracting terms from aURL is more challenging. For example, “jobsearch” and“usedcars” are terms that appear in URLs of the Job andAuto domains, respectively.

To deal with this problem,we try to identify meaningful sub-terms using the followingstrategy. After the terms are stemmed, the k most frequentterms are selected (topKTerms in line 8). Then, if a term inthis set appears as a substring of another term in the URLfeature set, its frequency is incremented. Once this processfinishes, the k most frequent terms are selected.The feature selection process must produce features thatare suitable for the learning scheme used by the underlying classifier. For text classification Zheng et al.

[29] showthat the Naı̈ve Bayes model obtains better results with amuch lower number of features than linear methods such asSupport Vector Machines [20]. As our link classifier is builtusing the Naı̈ve Bayes model, we performed an aggressivefeature selection and selected a small number of terms foreach feature space. The terms selected are the ones withhighest document frequency (DF)4 .

Характеристики

Тип файла

PDF-файл

Размер

1,45 Mb

Материал

тематика web-краулеров

Тип материала

Реферат

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов реферата

tematika-web-kraulerov.rar

тематика web-краулеров

An Adaptive Crawler ... перевод 4000 знаков.docx

An Adaptive Crawler for Locating Hidden-Web Entry Points (2007).pdf

Crawling AJAX ... перевод 5000 знаков.docx

Crawling AJAX by Inferring User Interface State Changes (2008).pdf

Задание.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.