Font Size: a A A

Research And Implementation Of The Improvement Of Web Crawler’s Performance And The Function Expansion

Posted on:2013-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:M JinFull Text:PDF
GTID:2248330395459452Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet technologies, websites are becoming the biggest carriers of information. How to get and make best use of such resources are of more and more importance but also an inevitable challenge. A web crawler is born to come with the tide of this need, such kind of program or script is able to automatically fetch web contents while follow certain rules.This paper firstly reviews the history and major applications of web crawlers. By means of overall researches on current popular web crawlers, we discovered that the majority of existing web crawlers is serving the search engines by rendering data resources to topic-specific queries from Internet users. Traditional web crawlers are so important to search engines that their adaptability and multi-functionality are progressively weakened when they are based on such crawling framework with superior extensibilities. Then, this paper discusses some of crawlers’performance indicators based on several optimization strategies for small and medium-scaled web crawlers to improve the running performance and to expand the crawlers’function.About the performance improvement, this paper designed diverse optimized strategies for different function modules. First, it uses Gzip compression on page contents to reduce transmission time. Second, it sends asynchronous requests to an Internet resource to raise bandwidth and CPU usage. Third, crawl in breadth-first order and conduct massive duplicated URL detection with bloom filter. Fourth, use well-designed regex to extract links out of a web page. Fifth, it normalizes crawled URL to avoid misleading the crawler by its abnormality otherwise. Sixth, use optimized thread pool for efficiently managed multi threads.About the function expansion module, this paper describes three distinguish characteristics between this crawler and the existing one. It firstly analyzes the performance of crawled static web pages and provides optimizing suggestions for target websites. Then it performs as an automatic test tool to run test cases on crawled web pages. At last, it extract customized data from web resources following specified formats.Based on the purpose of testifying the strategies above,.NET platform is quite suitable for building a light web crawler. This crawler is coded in C#language and programmed in Visual Studio2008IDE. Experimental results show that our crawler run in console mode with high configurability based on file inputs.
Keywords/Search Tags:Web crawler, Performance improvement, Function development
PDF Full Text Request
Related items