Research And Implementation Of The Improvement Of Web Crawler’s Performance And The Function Expansion

Posted on:2013-07-23

Degree:Master

Type:Thesis

Country:China

Candidate:M Jin

Full Text:PDF

GTID:2248330395459452

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of Internet technologies, websites are becoming the biggest carriers of information. How to get and make best use of such resources are of more and more importance but also an inevitable challenge. A web crawler is born to come with the tide of this need, such kind of program or script is able to automatically fetch web contents while follow certain rules.This paper firstly reviews the history and major applications of web crawlers. By means of overall researches on current popular web crawlers, we discovered that the majority of existing web crawlers is serving the search engines by rendering data resources to topic-specific queries from Internet users. Traditional web crawlers are so important to search engines that their adaptability and multi-functionality are progressively weakened when they are based on such crawling framework with superior extensibilities. Then, this paper discusses some of crawlers’performance indicators based on several optimization strategies for small and medium-scaled web crawlers to improve the running performance and to expand the crawlers’function.About the performance improvement, this paper designed diverse optimized strategies for different function modules. First, it uses Gzip compression on page contents to reduce transmission time. Second, it sends asynchronous requests to an Internet resource to raise bandwidth and CPU usage. Third, crawl in breadth-first order and conduct massive duplicated URL detection with bloom filter. Fourth, use well-designed regex to extract links out of a web page. Fifth, it normalizes crawled URL to avoid misleading the crawler by its abnormality otherwise. Sixth, use optimized thread pool for efficiently managed multi threads.About the function expansion module, this paper describes three distinguish characteristics between this crawler and the existing one. It firstly analyzes the performance of crawled static web pages and provides optimizing suggestions for target websites. Then it performs as an automatic test tool to run test cases on crawled web pages. At last, it extract customized data from web resources following specified formats.Based on the purpose of testifying the strategies above,.NET platform is quite suitable for building a light web crawler. This crawler is coded in C#language and programmed in Visual Studio2008IDE. Experimental results show that our crawler run in console mode with high configurability based on file inputs.

Keywords/Search Tags:

Web crawler, Performance improvement, Function development

PDF Full Text Request

Related items

1	The Improvement Of The Performance And Application Of Focused Crawlers Based On Hidden Markov Model
2	The Design And Development Of Deep-Customizable Crawler Tool System
3	Research On Topic Focused Web Crawler And Related Technologies
4	Survey On The Research Of Focused Crawler Technique Based Vulnerability Management Platform
5	Chromium Browser In The Development Board On The Performance Improvement Study
6	The Research And Development On A System For E-shopproductivity Improvement On The Tmall Platfrom
7	Research On Key Technologies Of A High-performance Web Crawler System
8	Design And Implementaion Of Recognition And Crawling Function In Web Service Crawler Engine
9	Intelligent Multi-function Polarization Measurement System For The Development And Improvement
10	Comprehensive Evaluation And Motion Analysis Of Crawler Robot With Pendulum Arm