Font Size: a A A

Research And Optimization Of Web Crawler System Under Distributed Environment

Posted on:2016-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:L B GengFull Text:PDF
GTID:2298330467493012Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Today, the information is growing at exponential growth trend, the scraping performance of the stand-alone web crawler system encounters a bottleneck. People develop the web crawler system which is based on the distributed system. It solves the grasping performance bottlenecks efficiently. The web crawler system adopts the multi-thread and asynchronous module to download the web contents. Fully multi-threaded way has some problems, such as synchronization problem and resource competition problem. These problems can be solved through designing thread management module which can reduce the performance of the system. At the same time, when the web crawler system deals with big data, the URL filter strategy has performance problem or storage problem. Therefore, the optimization of web scraping module and URL filter strategy has important engineering significance.According to the above problem, thesis puts forward the optimized schemes for web scraping module and URL filter strategy. We design the multi-threaded pool which adopts the half-synchronous and half-asynchronous module for the web scraping module. The main thread is responsible for the task scheduling and the worker thread is responsible for the specific logical processing. Handling of network events will be completed by the Libevent network library. URL filter strategy adopts caching mechanism. URLs which have high repeat degree will be stored in the buffer queue. This strategy can decrease the access frequency of the storage system and improve the efficiency. In this thesis, we design the web crawler which works on the Hadoop distributed environment according to the optimization schemes.Finally, through setting up the testing environment and designing function test cases and performance test cases, we test the optimized web crawler system. By comparing the grasping ability with the existing distributed web crawler system, we prove that the web crawler which we design is efficient. By comparing the URL filter time and accuracy with existing URL filter strategies, we prove that the URL strategy which we design is efficient.
Keywords/Search Tags:URL filter strategy, Libevent framework, Webcrawler, Hadoop
PDF Full Text Request
Related items