Font Size: a A A

Research And Implementation Of Distributed Web Crawler Technology

Posted on:2013-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y T WangFull Text:PDF
GTID:2248330374985992Subject:Information security
Abstract/Summary:PDF Full Text Request
The development of Internet explosion boosts websites from several thousands in1993to nearly billions nowadays, the number of which even now is sky-rocketing. With the fast development of Internet, the related service and information content is growing fast as well. In the meantime, when the information is the widely used by people, the network crawler which is responsible for collecting information, also face the big challenge. In present situation, some large firms, research institution (such as Google, Baidu etc.) at home and abroad have already provided some mature solutions, some of which have put into use.However,most of the solutions are not able to offer the users one kind of custom-made search service. Furthermore, a large number of companies rank the technique of the network crawler as commerce privacy and never make it public. This behavior does not meet the increasingly demand of the extensive users. Internet is too huge and complicated to completely collect all the websites even the search tycoon such as Google and Baidu, let alone the common users.This thesis makes an in-depth research on medium and small sized distributed crawler. A distributed web crawler based on MapReduce distributed computation model is designed and implemented. The main contents are as follows:Firstly,this thesis introduced web crawler related technologies and the universal computation model at the moment. Then, a distributed network crawler system, named DWCS, was designed on the basis of MapReduce distributed computation model. The DWCS consists of multiple PCs. The generalized crawler module is used to capture the web pages. The reduplicative URLs are cleared away by the master modulel, and then the URLs are assigned to crawler modules.Then,as designed above,this thesis implemented the distributed crawler using Python and Mincemeat.py.Finally,the DWCS was tested. The testing result was analyzed detailedly. After summarized the whole work at the present stage, a future plan was performed.
Keywords/Search Tags:Web Crawler, MapReduce, Python, URL Filter
PDF Full Text Request
Related items