Font Size: a A A

The Research On Focused Crawling Algorithm In Vertical Search Engine

Posted on:2010-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:K Q ChenFull Text:PDF
GTID:2178330332481938Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the ever-expanding information, people become increasingly dependent on vertical search engines. Surrounding the research on this hotspot, the important part of the topic-specific search engine that is called topical crawler is discussed in this paper. Focused crawling is able to crawl particular topical portions of the world wide web quickly without having to explore all web pages. And now, it is more and more widely applied in the fields of topic-specific search engines and analyzing site structure, and so on.We introduce the structure of rules-based crawler which can download "important" web pages first by counting the sum of URL important value and text relativity value. We give an algorithm we designed. This crawler uses the search algorithm both based on Web hyperlink structure and page content. In the course of the relativity judging between the page content and the topic, the method based on vector space model which is widely applied in the field of the text classification is used. Besides, the influence to weight of anchor text and text locality are taking into account. In the course of the relativity judging between the URL content and the topic, we develop a new method that statistically capture linkage relationships and extract rules among the classes (topics), then guide our focused crawler by using these rules. Meanwhile it has an improvement on Page Rank algorithm.In our work, we started with a focused-crawling approach designed by Soumen Chakrabarti, Martin van den Berg and Byron Dom, called baseline crawler. Building on this crawler, we developed a rule-based crawler, which uses simple rules derived from interclass (topic) linkage patterns to decide its next move.Initial performance results show that this rule-based Web-crawling approach uses linkage statistics among topics to improve a baseline focused crawler's harvest rate. This rule-based crawler also enhances the baseline crawler by supporting tunneling.
Keywords/Search Tags:vertical search engine, topical/focused crawler, interclass linkage, Baseline focused crawler, tunneling
PDF Full Text Request
Related items