An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904), страница 6

Файл №1176904 An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров) 6 страницаAn Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (1176904) страница 62020-08-172020-08-17СтудИзба

тематика web-краулеров

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 6)

Integrating the apprentice in our framework is a direction we plan to pursuein future work.Aggarwal et al. [1] proposed an online-learning strategy tolearn features of pages that satisfies a user-defined predicate.They start the search with a generic crawler. As new pagesthat satisfy the user-defined predicates are encountered, thecrawler gradually constructs its focus policy.

The methodof identifying relevant documents is composed by differentpredictors for content and link structure. Manual tuning isrequired to determine contribution of each predictor to thefinal result. In addition, similar to Chakrabarti et al. theirstrategy only learns features that give immediate benefit.Another drawback of this approach is its use of a genericcrawler at the beginning of its execution. Because a genericcrawler may need to visit a very large number of pages inorder to obtain a significant sample, the learning costs maybe prohibitive for sparse domains. As a point of comparison,consider the behavior of the online crawler for the Moviedomain (Section 4). Even using a focused crawler, only 94relevant forms are retrieved in a 100,000 page crawl, andthese were not sufficient to derive useful patterns.

A muchlarger number of pages would have to be crawled by a generalcrawler to obtain the same 94 forms.6.CONCLUSION AND FUTURE WORKWe have presented a new adaptive focused crawling strategy for efficiently locating hidden-Web entry points. Thisstrategy effectively balances the exploitation of acquired knowledge with the exploration of links with previously unknown449WWW 2007 / Track: SearchSession: Crawlerspatterns, making it robust and able to correct biases introduced in the learning process. We have shown, through adetailed experimental evaluation, that substantial increasesin harvest rates are obtained as crawlers learn from new experiences.

Since crawlers that learn from scratch are able toobtain harvest rates that are comparable to, and sometimeshigher than manually configured crawlers, this frameworkcan greatly reduce the effort to configure a crawler. In addition, by using the form classifier, ACHE produces highquality results that are crucial for a number informationintegration tasks.There are several important directions we intend to pursue in future work. As discussed in Section 5, we wouldlike to integrate the apprentice of [8] into the ACHE framework. To accelerate the learning process and better handlevery sparse domains, we will investigate the effectivenessand trade-offs involved in using back-crawling during thelearning iterations to increase the number of sample paths.Finally, to further reduce the effort of crawler configuration,we are currently exploring strategies to simplify the creationof the domain-specific form classifiers.

In particular, the useof form clusters obtained by the online-database clusteringtechnique described in [5] as the training set for the classifier.[11][12][13][14][15][16][17][18]Acknowledgments. This work is partially supported bythe National Science Foundation (under grants IIS-0513692,CNS-0524096, IIS-0534628) and a University of Utah SeedGrant.7.[19][20][21]REFERENCES[1] C. C.

Aggarwal, F. Al-Garawi, and P. S. Yu.Intelligent crawling on the world wide web witharbitrary predicates. In Proceedings of WWW, pages96–105, 2001.[2] L. Barbosa and J. Freire. Siphoning Hidden-Web Datathrough Keyword-Based Interfaces. In Proceedings ofSBBD, pages 309–321, 2004.[3] L. Barbosa and J. Freire. Searching for Hidden-WebDatabases. In Proceedings of WebDB, pages 1–6, 2005.[4] L. Barbosa and J. Freire. Combining classifiers toidentify online databases. In Proceedings of WWW,2007.[5] L. Barbosa and J. Freire. Organizing hidden-webdatabases by clustering visible web documents.

InProceedings of ICDE, 2007. To appear.[6] K. Bharat, A. Broder, M. Henzinger, P. Kumar, andS. Venkatasubramanian. The connectivity server: Fastaccess to linkage information on the Web. ComputerNetworks, 30(1-7):469–477, 1998.[7] Brightplanet’s searchable databases directory.http://www.completeplanet.com.[8] S. Chakrabarti, K. Punera, and M. Subramanyam.Accelerated focused crawling through online relevancefeedback.

In Proceedings of WWW, pages 148–159,2002.[9] S. Chakrabarti, M. van den Berg, and B. Dom.Focused Crawling: A New Approach to Topic-SpecificWeb Resource Discovery. Computer Networks,31(11-16):1623–1640, 1999.[10] K. C.-C. Chang, B. He, and Z. Zhang. TowardLarge-Scale Integration: Building a MetaQuerier over[22][23][24][25][26][27][28][29]450Databases on the Web.

In Proceedings of CIDR, pages44–55, 2005.M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, andM. Gori. Focused Crawling Using Context Graphs. InProceedings of VLDB, pages 527–534, 2000.T. Dunnin. Accurate methods for the statistics ofsurprise and coincidence. Computational Linguistics,19(1):61–74, 1993.M. Galperin. The molecular biology databasecollection: 2005 update. Nucleic Acids Res, 33, 2005.Google Base. http://base.google.com/.L. Gravano, H.

Garcia-Molina, and A. Tomasic. Gloss:Text-source discovery over the internet. ACM TODS,24(2), 1999.B. He and K. C.-C. Chang. Statistical SchemaMatching across Web Query Interfaces. In Proceedingsof ACM SIGMOD, pages 217–228, 2003.H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator:An automatic integrator of web search interfaces fore-commerce.

In Proceedings of VLDB, pages 357–368,2003.W. Hsieh, J. Madhavan, and R. Pike. Datamanagement projects at Google. In Proceedings ofACM SIGMOD, pages 725–726, 2006.H. Liu, E. Milios, and J. Janssen. Probabilistic modelsfor focused web crawling. In Proceedings of WIDM,pages 16–22, 2004.T. Mitchell. Machine Learning. McGraw Hill, 1997.S. Raghavan and H. Garcia-Molina. Crawling theHidden Web. In Proceedings of VLDB, pages 129–138,2001.J. Rennie and A. McCallum. Using ReinforcementLearning to Spider the Web Efficiently. In Proceedingsof ICML, pages 335–343, 1999.S. Russell and P. Norvig.

Artificial Intelligence: AModern Approach. Prentice Hall, 2002.S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer,M. Theobald, G. Weikum, and P. Zimmer. TheBINGO! System for Information Portal Generationand Expert Web Search. In Proceedings of CIDR,2003.W. Wu, C. Yu, A. Doan, and W. Meng. AnInteractive Clustering-based Approach to IntegratingSource Query interfaces on the Deep Web. InProceedings of ACM SIGMOD, pages 95–106, 2004.J. Xu and J. Callan.

Effective retrieval withdistributed collections. In Proceedings of SIGIR, pages112–120, 1998.Y. Yang and J. O. Pedersen. A Comparative Study onFeature Selection in Text Categorization. InInternational Conference on Machine Learning, pages412–420, 1997.C. Yu, K.-L. Liu, W. Meng, Z. Wu, and N. Rishe. Amethodology to retrieve text documents from multipledatabases. TKDE, 14(6):1347–1361, 2002.Z.

Zheng, X. Wu, and R. Srihari. Feature selection fortext categorization on imbalanced data. ACMSIGKDD Explorations Newsletter, 6(1):80–89, 2004..

Характеристики

Тип файла

PDF-файл

Размер

1,45 Mb

Материал

тематика web-краулеров

Тип материала

Реферат

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов реферата

tematika-web-kraulerov.rar

тематика web-краулеров

An Adaptive Crawler ... перевод 4000 знаков.docx

An Adaptive Crawler for Locating Hidden-Web Entry Points (2007).pdf

Crawling AJAX ... перевод 5000 знаков.docx

Crawling AJAX by Inferring User Interface State Changes (2008).pdf

Задание.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.