Research On Focused Crawler Technology

Posted on:2009-12-19

Degree:Master

Type:Thesis

Country:China

Candidate:X G Ni

Full Text:PDF

GTID:2178360272456761

Subject:Computer software and theory

Abstract/Summary:

With the explosive growth of the online information resources, the Web has become the most enormous information repository to date. Confronted with this huge, heterogeneous and semi-structural information repository, Web users often have to spend a lot of time and efforts to find information needed. This contradiction is generally called"information overload on the Web". To solve this problem, topic-driven crawling has been proposed in Web information retrieval community in recent years. The system uses an intelligent focused crawler to collect high relevant documents online with regard to the predefined target topics, and analysis the information collected through machine learning and information retrieval techniques, which results in an efficient and convenient information retrieval approach for the users. The fundamental theory and technology include machine learning, information retrieval, statistics and new web technologies. It can be applied to various applications, including Web-based industry analysis, and automatic digital library etc.This paper introduces the theory and architecture of the search engines and focused crawler, and it emphatically analyzes the topic defining, web hyperlink analyze and content analyze algorithms, and the crawling strategy of the focused crawler.HITS is good at discover topic web community, but it often occurs"topic drift"problem. To avoid getting into the local optimum of the Best First Search, this paper proposes a new topic crawling strategy. It combines the hyperlink rank and content topic relevance to calculate the total rank of the target pages. It uses the HITS algorithm to compute the hyperlink rank of the urls, discovers and fetches the web community and authority pages, and uses the topic relevance decision algorithm base on VSM model to accurately quantizing the relevance of the crawled pages. It increases the harvest rate of the crawler because of avoiding to occur the"topic drift". According to the target topic definition of the focused crawler, it uses text categorization algorithm to build a topic characteristic lexicon, and extracts topic characteristics to compute the relevance of the web.Finally, a focused crawler system prototype base on synthesized crawling strategy is designed. It improves the architecture of the existing focused crawler. And this system implements the intelligent collection of the topic web resources. The harvest rate of the system got from experiment shows the validity of the synthesized crawling strategy.

Keywords/Search Tags:

Vertical Search Engine, Focused Crawler, Hyperlink Analyze, Content Analyze, Hypertext Classification

Related items

1	The Optimization And Achieve For Focused Crawling Algorithm Based On The Website Content Framework
2	Research And Implementation On Focused Crawler With New Strategy For The Vertical Search Engine
3	Research On An Algorithm Of Focused Crawler In Vertical Search Engine
4	The Research On Focused Crawling Algorithm In Vertical Search Engine
5	Research And Design On Focused Crawler Of Search Engine
6	Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine
7	Research Of Main Technologies Of Vertical Search Engine
8	Research And Implementation Of Focused Crawlerâ€™Search Strategy In The Vertical Search Engine
9	Research On Focused Crawler Technology Of Vertical Search Engine
10	Research On Topical Crawler Combining Web Page Content And Hyperlink