An Adaptive Crawler for Locating Hidden-Web Entry Points (2007) (тематика web-краулеров), страница 6
Описание файла
Файл "An Adaptive Crawler for Locating Hidden-Web Entry Points (2007)" внутри архива находится в папке "тематика web-краулеров". PDF-файл из архива "тематика web-краулеров", который расположен в категории "". Всё это находится в предмете "английский язык" из 9 семестр (1 семестр магистратуры), которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 6 страницы из PDF
Integrating the apprentice in our framework is a direction we plan to pursuein future work.Aggarwal et al. [1] proposed an online-learning strategy tolearn features of pages that satisfies a user-defined predicate.They start the search with a generic crawler. As new pagesthat satisfy the user-defined predicates are encountered, thecrawler gradually constructs its focus policy.
The methodof identifying relevant documents is composed by differentpredictors for content and link structure. Manual tuning isrequired to determine contribution of each predictor to thefinal result. In addition, similar to Chakrabarti et al. theirstrategy only learns features that give immediate benefit.Another drawback of this approach is its use of a genericcrawler at the beginning of its execution. Because a genericcrawler may need to visit a very large number of pages inorder to obtain a significant sample, the learning costs maybe prohibitive for sparse domains. As a point of comparison,consider the behavior of the online crawler for the Moviedomain (Section 4). Even using a focused crawler, only 94relevant forms are retrieved in a 100,000 page crawl, andthese were not sufficient to derive useful patterns.
A muchlarger number of pages would have to be crawled by a generalcrawler to obtain the same 94 forms.6.CONCLUSION AND FUTURE WORKWe have presented a new adaptive focused crawling strategy for efficiently locating hidden-Web entry points. Thisstrategy effectively balances the exploitation of acquired knowledge with the exploration of links with previously unknown449WWW 2007 / Track: SearchSession: Crawlerspatterns, making it robust and able to correct biases introduced in the learning process. We have shown, through adetailed experimental evaluation, that substantial increasesin harvest rates are obtained as crawlers learn from new experiences.
Since crawlers that learn from scratch are able toobtain harvest rates that are comparable to, and sometimeshigher than manually configured crawlers, this frameworkcan greatly reduce the effort to configure a crawler. In addition, by using the form classifier, ACHE produces highquality results that are crucial for a number informationintegration tasks.There are several important directions we intend to pursue in future work. As discussed in Section 5, we wouldlike to integrate the apprentice of [8] into the ACHE framework. To accelerate the learning process and better handlevery sparse domains, we will investigate the effectivenessand trade-offs involved in using back-crawling during thelearning iterations to increase the number of sample paths.Finally, to further reduce the effort of crawler configuration,we are currently exploring strategies to simplify the creationof the domain-specific form classifiers.
In particular, the useof form clusters obtained by the online-database clusteringtechnique described in [5] as the training set for the classifier.[11][12][13][14][15][16][17][18]Acknowledgments. This work is partially supported bythe National Science Foundation (under grants IIS-0513692,CNS-0524096, IIS-0534628) and a University of Utah SeedGrant.7.[19][20][21]REFERENCES[1] C. C.
Aggarwal, F. Al-Garawi, and P. S. Yu.Intelligent crawling on the world wide web witharbitrary predicates. In Proceedings of WWW, pages96–105, 2001.[2] L. Barbosa and J. Freire. Siphoning Hidden-Web Datathrough Keyword-Based Interfaces. In Proceedings ofSBBD, pages 309–321, 2004.[3] L. Barbosa and J. Freire. Searching for Hidden-WebDatabases. In Proceedings of WebDB, pages 1–6, 2005.[4] L. Barbosa and J. Freire. Combining classifiers toidentify online databases. In Proceedings of WWW,2007.[5] L. Barbosa and J. Freire. Organizing hidden-webdatabases by clustering visible web documents.
InProceedings of ICDE, 2007. To appear.[6] K. Bharat, A. Broder, M. Henzinger, P. Kumar, andS. Venkatasubramanian. The connectivity server: Fastaccess to linkage information on the Web. ComputerNetworks, 30(1-7):469–477, 1998.[7] Brightplanet’s searchable databases directory.http://www.completeplanet.com.[8] S. Chakrabarti, K. Punera, and M. Subramanyam.Accelerated focused crawling through online relevancefeedback.
In Proceedings of WWW, pages 148–159,2002.[9] S. Chakrabarti, M. van den Berg, and B. Dom.Focused Crawling: A New Approach to Topic-SpecificWeb Resource Discovery. Computer Networks,31(11-16):1623–1640, 1999.[10] K. C.-C. Chang, B. He, and Z. Zhang. TowardLarge-Scale Integration: Building a MetaQuerier over[22][23][24][25][26][27][28][29]450Databases on the Web.
In Proceedings of CIDR, pages44–55, 2005.M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, andM. Gori. Focused Crawling Using Context Graphs. InProceedings of VLDB, pages 527–534, 2000.T. Dunnin. Accurate methods for the statistics ofsurprise and coincidence. Computational Linguistics,19(1):61–74, 1993.M. Galperin. The molecular biology databasecollection: 2005 update. Nucleic Acids Res, 33, 2005.Google Base. http://base.google.com/.L. Gravano, H.
Garcia-Molina, and A. Tomasic. Gloss:Text-source discovery over the internet. ACM TODS,24(2), 1999.B. He and K. C.-C. Chang. Statistical SchemaMatching across Web Query Interfaces. In Proceedingsof ACM SIGMOD, pages 217–228, 2003.H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator:An automatic integrator of web search interfaces fore-commerce.
In Proceedings of VLDB, pages 357–368,2003.W. Hsieh, J. Madhavan, and R. Pike. Datamanagement projects at Google. In Proceedings ofACM SIGMOD, pages 725–726, 2006.H. Liu, E. Milios, and J. Janssen. Probabilistic modelsfor focused web crawling. In Proceedings of WIDM,pages 16–22, 2004.T. Mitchell. Machine Learning. McGraw Hill, 1997.S. Raghavan and H. Garcia-Molina. Crawling theHidden Web. In Proceedings of VLDB, pages 129–138,2001.J. Rennie and A. McCallum. Using ReinforcementLearning to Spider the Web Efficiently. In Proceedingsof ICML, pages 335–343, 1999.S. Russell and P. Norvig.
Artificial Intelligence: AModern Approach. Prentice Hall, 2002.S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer,M. Theobald, G. Weikum, and P. Zimmer. TheBINGO! System for Information Portal Generationand Expert Web Search. In Proceedings of CIDR,2003.W. Wu, C. Yu, A. Doan, and W. Meng. AnInteractive Clustering-based Approach to IntegratingSource Query interfaces on the Deep Web. InProceedings of ACM SIGMOD, pages 95–106, 2004.J. Xu and J. Callan.
Effective retrieval withdistributed collections. In Proceedings of SIGIR, pages112–120, 1998.Y. Yang and J. O. Pedersen. A Comparative Study onFeature Selection in Text Categorization. InInternational Conference on Machine Learning, pages412–420, 1997.C. Yu, K.-L. Liu, W. Meng, Z. Wu, and N. Rishe. Amethodology to retrieve text documents from multipledatabases. TKDE, 14(6):1347–1361, 2002.Z.
Zheng, X. Wu, and R. Srihari. Feature selection fortext categorization on imbalanced data. ACMSIGKDD Explorations Newsletter, 6(1):80–89, 2004..