Font Size: a A A

Research On A Method Of Focused Crawler Based On Page Partition

Posted on:2012-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:M L XingFull Text:PDF
GTID:2218330362954437Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the increasing of information on internet, the results of general search engine can't meet the needs of the users, especially those with different area or different background who want to search information in specialized fields. Then vertical search engines which are much more professional and personalized emerge as the times require.A key part of vertical search engine is the focused crawler by which topical portions of the internet can be downloaded whose quality can directly affect the results of vertical search engines. The major factors of the quality of topical portions downloaded are how to predict the relevance between downloaded page and the target topic and how to decide the visit priority of candidate URL in crawling frontier; also the structural characteristics of Web page will result in"tunnel"phenomenon that exists in focus crawling, which will cause big effect on the coverage and accuracy of focus crawler.According to the advantages of page blocks and best-first search strategy and in order to help the crawler cross"tunnel", this paper proposes a new method based on page partition:①With the thought of the classifier to a focus crawler, a Center for Vector-type classifier is used to determine the topic of a page or a page block and their similarity. The outstanding advantage of this classifier is that it can better describe user's interest topics; also it runs fast and can improve crawler's speed;②In consideration of the difference between URLs, the URLs in a page block are classified as special URL and common URL to predict their visit priorities, which can improve the accuracy of prediction, resulting in overcoming the problem that related pages are neglected caused by inaccurately prediction;③Based on page block, using transition probability between classes can help the crawler cross"black tunnel"and using the thought of page partition can help the crawler cross"grey tunnel".Finally with 3 different target topic from DMOZ, comparison experiments are carried out on Harvest Ratio between three crawler that are separately realized by the method this paper proposed, the text- contents method and the classifier-guided method, and the experimental results show that the method this paper proposed performs better than other two in Harvest Ratio and can better meet the needs of the focus crawler.
Keywords/Search Tags:Focus crawler, Tunnel, Classifier, Page partition, Transition probability between classes
PDF Full Text Request
Related items