Font Size: a A A

Focused Crawling System Based Text Experience Models

Posted on:2007-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q LiuFull Text:PDF
GTID:2178360182496196Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The size of the Web is increasing rapidly by 1996, traditional informationretrieval systems already can't retrieve entire Internet Webs. FocusedCrawling has become an increasingly important technology that is researchedby more and more people. Focused crawling system searches Webinformation of some topics automatically. With the system help, people cansearch for information more quickly. As a new technology, there are stillsome aspects need to be researched in depth. The research of the FocusedCrawling has become a focus point today. Early days, focused crawling algorithms are researched mainly based twoaspects: one point of view is evaluation of web based World Wide Webtopology;another is web evaluation based web content. Two methods selectURLs on the web that have the highest evaluation or importance value in allwebs downloaded. Many effective focused crawling algorithms evaluate webimportance based these methods. Except web importance methods, there aresome algorithms named "intelligent agent". Intelligent agent algorithms adoptmulti crawling agents who have different crawling strategies. The crawlingagent,which adopts excellent crawl strategy will live more longer time thanbad ones and can produce a new individual with high probability. Badcrawling agent will die at last. These algorithms emphasis on differentviewpoint, but they all estimate URL priority by estimating importance ofURL's parent page. The URLs of same page have same priority. It is harmfulto the crawling system, because URLs of one topic page have differentpriority, however crawling system fetches them without consideration. In2001, Chakrabarti brought up one new algorithm---"Accelerated FocusedCrawling through Online Relevance Feedback". The new algorithm estimatesURLs priority by URLs' own information. Every URL has a priority estimateeven if they belong to the same page. Chakrabarti's algorithm brings FocusedCrawling research to a new direction. The algorithm means FocusedCrawling research should be focused on URLs' own information not onlypage content.This article mainly makes further research base Chakrabarti's idea. Thepaper brings up a new algorithm ---"Focused Crawling System base TextExperience Models". This algorithm estimates URL priority by buildingexperience models and select the URLs with highest priority value to crawl.Algorithms in the past usually don't make use of the information that issimulated by the crawling system during crawling procedure. In thiscondition, the crawling system maybe make the same mistake many timesand can't correct its action, so crawling system's topic page harvest ratio islower relatively. To overcome old focused crawling system's shortcomings,the paper build text experience models to direct crawling system fetch topicpages with high probability. The crawling system can make use of crawlingexperience to correct system's wrong crawl steps, so the crawling system canget more topic pages than other crawling system.In this paper we mainly discuss approach of building text experiencemodels and the algorithm of topic focused crawling system to fetch Webpages with text experience models. First text experience models can describeexperience data's natural characterization, and build experience models onthe basis of natural property of experience data. Second, the system can buildtext experience models while it fetches Web pages. Work procedure and learnprocedure can run parallel.To build text experiences models, the system must extract experiencemeta-data from Web pages that have been downloaded already. Meta-dataare composed of topic probability of URL's parent pages and the textinformation near the URL. Then, we get several experience data clusters byclustering analysis algorithm. Next step, we build text experience models byNa?ve Bayes algorithm. We input the text information near the URLs;thesystem can estimate topic probability of the page, which is linked by thisURL. Text experience models direct crawling system's topic direction in thisway. Focused crawling system will simulate crawling experience for teachingitself, so the system has ability of constantly learning.In experiment, we first run crawling system with BF algorithm toaccumulate crawling experiences. When meta-data was collected enough, theBF crawling system will stop. Then we save the database as next crawl's startpoint, and in basis of this database, we run focused crawling system withdifferent algorithm. We compare BF algorithm, Online Relevance Feedbackalgorithm with our algorithm based text experience models. The experimentshows that focused crawling based text experience models algorithm canacquire better result than others. At the same time, experiment data showsthat our algorithm has more adapt ability and stronger than other algorithm.Our algorithm can get a stable result even if the topic is a small knowledgearea.The focused crawling based text experience models algorithm still can beimproved. I think that if we take further research in future, the experiencemodels algorithm will bring more improvement to focused crawlingtechnology.
Keywords/Search Tags:Experience
PDF Full Text Request
Related items