Research On Web Crawling Strategies

Posted on:2011-12-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Weng

Full Text:PDF

GTID:2178330332960046

Subject:Computer application technology

Abstract/Summary:

With the explosive growth of information on the Web, people can not directly and accurately locate resources what they interest in,so that they are more and more dependent on search engines. However, owing to the large scale of Web,it makes any Web crawler were unable to obtain all the Web pages. Since Web crawlers can not crawl to all pages, it is need to crawl as much important web pages as possible in a limited period of time. Web crawling strategy is to study in what order to access the Web pages,making web crawlers have priority access to important Web pages.Firstly,this thesis intensively analyzes the key technologies of building efficient Web crawlers,including URL scheduler that determines the order of web crawling. URL scheduler is the main functional components to realize web crawling strategies.Then,on the basis of analyzing a variety of web pages importance assessment criteria,the paper selects PageRank based link analysis assessment criteria as a basis of evaluating web page importance.PageRank technology makes full use of hyperlinks information between Web page,which comprehensively considers the reverse link number and quality to a Web page. PageRank technology also objectively defines their relative importance on the entire Web. Finally,a good Web crawling strategy is discovered,which has priority access to important pages and simultaneously meet the requirements of crawling speed,courtesy and balance of the crawled site. However, the existing web crawlers can not very well meet these requirements at the same time. Therefore, this thesis has designed a comprehensive weight web crawling strategy, using two priority scheduling policy. Site-level scheduling can meet the courtesy and balance requirements,while web-level scheduling can access to higher quality pages through the introduction of historical information mechanism.This thesis has designed and developed a WebCrawler to obtain the required experimental data sets, and on its basis, using a virtual crawl way to compare different web crawling strategies. Because Web dynamically changes,it makes virtual crawl become the only way to ensure different Web crawling strategies can be compared under the same conditions.Experiments show that the comprehensive weight web crawling strategy can obtain better quality in the context of meeting the requirements of crawling speed, courtesy and balance.

Keywords/Search Tags:

Web Crawler, Web Page Importance, Web Crawling Strategies, PageRank

Related items

1	Vertical Search Engine For Crawling The Web Page Design And Implementation
2	Web Page Importance Ranking With Priori Knowledge
3	The Research On Key Techniques For Page Segmentation Based Forum Crawler
4	Design And Implementation Of Web Crawler For Given Page
5	Design And Implementation Of Focused Crawler
6	The Static Ranking Algorithm Of Web Pages Based On The Importance Propagation Model
7	Research On Topical Crawler Combining Web Page Content And Hyperlink
8	Distributed Web Crawler System
9	Research And Implementation Of A Combined Focused Crawler Based On Protocol-Driven And Event-Driven Crawling Techniques
10	Design And Implementation Of The Theme Crawler For Procurement Clues In The Automotive Field