Font Size: a A A

Distributed Web Crawler System Design And Implementation

Posted on:2014-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y LvFull Text:PDF
GTID:2268330425968830Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The Internet industry has been maintaining a rapid growth since2000, the quantityof information is in exponential growth, all these result in that people will have to spenda lot of time searching the information in need, so peple’s desire for the information inneed whenever and wherever possible becomes more and more strong. Based on thissituation, cloud computing grabs the opportunity to develop. Almost all of thecompanies all over the world invest a lot of manpower, material and financial resourcesin cloud computing, including Google、IBM、Apache and Amazon. The Hadoopplatform developed by Apache from the user’s point of view is an open source cloudcomputing framework. The distributed web crawler developed in this thesis is designedand implemented under this framework.In this thesis, we are trying to design and implement a distributed web crawlerbased on Hadoop. We can achieve the task of large-scale data collection through this. Atthe mean time, the crawler system can collect various kinds of information and collectmainstream news site information in many kinds of languages all over the world. Inaddition, the information of the many kinds of languages are not stored together butindependent. It can provide convenience for cross-language processing in the future.This thesis mainly studies the following several parts: first, specific introduction ofcoud computing-related knowledge; then, introduction of Hadoop distributed platformrelated information; finally, investigation of the current situation of the development ofweb crawler principle through the methods of literature.The research mentioned above is the fundamental basis for this article. Based onthis, we propose a Hadoop-based distributed web crawler system design proposal. Thedesign proposal includes not only the setup process, but also details of the system andthe basic framework. Moreover, it includes System function module division and thedesign of each module Map/Reduce. A summary of this article is in the end and wepropose article research directions for further study.In a word, the main significance of this thesis is to design and implement adistributed crawler system based on Hadoop. The system not only changes theinefficiencies of the crawler system, but also improve the the scalability of the system.Additionally, the scale of information gathering speed has been improved gradually. As a result,it can provide valid data for indexing module and information processingmodule of "Distributed cross-language information access and retrieval platform".
Keywords/Search Tags:distributed crawler System, Map/Reduce, HDFS, Search Engine, CloudComputing
PDF Full Text Request
Related items