Research And Optimization Of Web Crawler System Under Distributed Environment

Posted on:2016-07-11

Degree:Master

Type:Thesis

Country:China

Candidate:L B Geng

Full Text:PDF

GTID:2298330467493012

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Today, the information is growing at exponential growth trend, the scraping performance of the stand-alone web crawler system encounters a bottleneck. People develop the web crawler system which is based on the distributed system. It solves the grasping performance bottlenecks efficiently. The web crawler system adopts the multi-thread and asynchronous module to download the web contents. Fully multi-threaded way has some problems, such as synchronization problem and resource competition problem. These problems can be solved through designing thread management module which can reduce the performance of the system. At the same time, when the web crawler system deals with big data, the URL filter strategy has performance problem or storage problem. Therefore, the optimization of web scraping module and URL filter strategy has important engineering significance.According to the above problem, thesis puts forward the optimized schemes for web scraping module and URL filter strategy. We design the multi-threaded pool which adopts the half-synchronous and half-asynchronous module for the web scraping module. The main thread is responsible for the task scheduling and the worker thread is responsible for the specific logical processing. Handling of network events will be completed by the Libevent network library. URL filter strategy adopts caching mechanism. URLs which have high repeat degree will be stored in the buffer queue. This strategy can decrease the access frequency of the storage system and improve the efficiency. In this thesis, we design the web crawler which works on the Hadoop distributed environment according to the optimization schemes.Finally, through setting up the testing environment and designing function test cases and performance test cases, we test the optimized web crawler system. By comparing the grasping ability with the existing distributed web crawler system, we prove that the web crawler which we design is efficient. By comparing the URL filter time and accuracy with existing URL filter strategies, we prove that the URL strategy which we design is efficient.

Keywords/Search Tags:

URL filter strategy, Libevent framework, Webcrawler, Hadoop

PDF Full Text Request

Related items

1	Research And Implementation Of Terminal Security Access Management System Based On Libevent
2	Join Processing And Optimizing On Large Data Sets Based On Hadoop Framework
3	Research And Implementation Of Recommendation Engine Based On The Hadoop And Mahout For Intelligent Terminals Cloud Applications
4	Crawling Strategy And Parsing Method Of Focused Crawlers Based On Hadoop Platform
5	Research On Dynamic Management Of Data Replicas In Heterogeneous Hadoop Cluster
6	An Improved Method Of Apriori Algorithm Based On Hadoop
7	Research On Hadoop Based Iterative Data Processing And Data Placement Strategy
8	Design And Implementation Of Application System Framework IMSAA Based On Hadoop
9	Research Of The Energy-efficient Scheduler For Hadoop Based On Storage Driven
10	Authentication Research On The Hadoop Framework