Font Size: a A A

Research On Focused Crawler Technology

Posted on:2009-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:X G NiFull Text:PDF
GTID:2178360272456761Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the explosive growth of the online information resources, the Web has become the most enormous information repository to date. Confronted with this huge, heterogeneous and semi-structural information repository, Web users often have to spend a lot of time and efforts to find information needed. This contradiction is generally called"information overload on the Web". To solve this problem, topic-driven crawling has been proposed in Web information retrieval community in recent years. The system uses an intelligent focused crawler to collect high relevant documents online with regard to the predefined target topics, and analysis the information collected through machine learning and information retrieval techniques, which results in an efficient and convenient information retrieval approach for the users. The fundamental theory and technology include machine learning, information retrieval, statistics and new web technologies. It can be applied to various applications, including Web-based industry analysis, and automatic digital library etc.This paper introduces the theory and architecture of the search engines and focused crawler, and it emphatically analyzes the topic defining, web hyperlink analyze and content analyze algorithms, and the crawling strategy of the focused crawler.HITS is good at discover topic web community, but it often occurs"topic drift"problem. To avoid getting into the local optimum of the Best First Search, this paper proposes a new topic crawling strategy. It combines the hyperlink rank and content topic relevance to calculate the total rank of the target pages. It uses the HITS algorithm to compute the hyperlink rank of the urls, discovers and fetches the web community and authority pages, and uses the topic relevance decision algorithm base on VSM model to accurately quantizing the relevance of the crawled pages. It increases the harvest rate of the crawler because of avoiding to occur the"topic drift". According to the target topic definition of the focused crawler, it uses text categorization algorithm to build a topic characteristic lexicon, and extracts topic characteristics to compute the relevance of the web.Finally, a focused crawler system prototype base on synthesized crawling strategy is designed. It improves the architecture of the existing focused crawler. And this system implements the intelligent collection of the topic web resources. The harvest rate of the system got from experiment shows the validity of the synthesized crawling strategy.
Keywords/Search Tags:Vertical Search Engine, Focused Crawler, Hyperlink Analyze, Content Analyze, Hypertext Classification
PDF Full Text Request
Related items