Distributed Web Crawler System Design And Implementation

Posted on:2014-12-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y Lv

Full Text:PDF

GTID:2268330425968830

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The Internet industry has been maintaining a rapid growth since2000, the quantityof information is in exponential growth, all these result in that people will have to spenda lot of time searching the information in need, so peple’s desire for the information inneed whenever and wherever possible becomes more and more strong. Based on thissituation, cloud computing grabs the opportunity to develop. Almost all of thecompanies all over the world invest a lot of manpower, material and financial resourcesin cloud computing, including Google、IBM、Apache and Amazon. The Hadoopplatform developed by Apache from the user’s point of view is an open source cloudcomputing framework. The distributed web crawler developed in this thesis is designedand implemented under this framework.In this thesis, we are trying to design and implement a distributed web crawlerbased on Hadoop. We can achieve the task of large-scale data collection through this. Atthe mean time, the crawler system can collect various kinds of information and collectmainstream news site information in many kinds of languages all over the world. Inaddition, the information of the many kinds of languages are not stored together butindependent. It can provide convenience for cross-language processing in the future.This thesis mainly studies the following several parts: first, specific introduction ofcoud computing-related knowledge; then, introduction of Hadoop distributed platformrelated information; finally, investigation of the current situation of the development ofweb crawler principle through the methods of literature.The research mentioned above is the fundamental basis for this article. Based onthis, we propose a Hadoop-based distributed web crawler system design proposal. Thedesign proposal includes not only the setup process, but also details of the system andthe basic framework. Moreover, it includes System function module division and thedesign of each module Map/Reduce. A summary of this article is in the end and wepropose article research directions for further study.In a word, the main significance of this thesis is to design and implement adistributed crawler system based on Hadoop. The system not only changes theinefficiencies of the crawler system, but also improve the the scalability of the system.Additionally, the scale of information gathering speed has been improved gradually. As a result,it can provide valid data for indexing module and information processingmodule of "Distributed cross-language information access and retrieval platform".

Keywords/Search Tags:

distributed crawler System, Map/Reduce, HDFS, Search Engine, CloudComputing

PDF Full Text Request

Related items

1	The Study Of The Framework Of Distributed Intelligent Search Engine Based On Map/Reduce
2	The Research Of Distributed Search Engine Technology Based On Pagerank Algorithm
3	Distributed Web Crawler Technology Based On Hadoop
4	Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture
5	Research Of A Distributed Web Crawler Search Engine Based On Web Information Collection
6	Research And Implementation, Based On A Distributed Search Engine Framework
7	Research And Implementation Of A Distributed Web Services Search Engine Based On Map/Reduce
8	Distributed Web Crawler System
9	Design And Implementation Of Search Engine Based On Web Crawler
10	The Research Of Distributed Price Search Engine Based On DHT