Font Size: a A A

Research On Web Crawling Strategies

Posted on:2011-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q WengFull Text:PDF
GTID:2178330332960046Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of information on the Web, people can not directly and accurately locate resources what they interest in,so that they are more and more dependent on search engines. However, owing to the large scale of Web,it makes any Web crawler were unable to obtain all the Web pages. Since Web crawlers can not crawl to all pages, it is need to crawl as much important web pages as possible in a limited period of time. Web crawling strategy is to study in what order to access the Web pages,making web crawlers have priority access to important Web pages.Firstly,this thesis intensively analyzes the key technologies of building efficient Web crawlers,including URL scheduler that determines the order of web crawling. URL scheduler is the main functional components to realize web crawling strategies.Then,on the basis of analyzing a variety of web pages importance assessment criteria,the paper selects PageRank based link analysis assessment criteria as a basis of evaluating web page importance.PageRank technology makes full use of hyperlinks information between Web page,which comprehensively considers the reverse link number and quality to a Web page. PageRank technology also objectively defines their relative importance on the entire Web. Finally,a good Web crawling strategy is discovered,which has priority access to important pages and simultaneously meet the requirements of crawling speed,courtesy and balance of the crawled site. However, the existing web crawlers can not very well meet these requirements at the same time. Therefore, this thesis has designed a comprehensive weight web crawling strategy, using two priority scheduling policy. Site-level scheduling can meet the courtesy and balance requirements,while web-level scheduling can access to higher quality pages through the introduction of historical information mechanism.This thesis has designed and developed a WebCrawler to obtain the required experimental data sets, and on its basis, using a virtual crawl way to compare different web crawling strategies. Because Web dynamically changes,it makes virtual crawl become the only way to ensure different Web crawling strategies can be compared under the same conditions.Experiments show that the comprehensive weight web crawling strategy can obtain better quality in the context of meeting the requirements of crawling speed, courtesy and balance.
Keywords/Search Tags:Web Crawler, Web Page Importance, Web Crawling Strategies, PageRank
PDF Full Text Request
Related items