The Research And Implementation Of Distributed Web Crawler System Based On Hadoop

Posted on:2016-11-21

Degree:Master

Type:Thesis

Country:China

Candidate:R Sun

Full Text:PDF

GTID:2298330467991873

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The Internet is becoming more and more important during peopleâ€™s life, and website, forum, QQ, email and so on, have been the main ways to communicate and get information. With so many kinds of service patterns, the Internet has evolved a virtual society. How to make sure the virtual society in charge of efficient and secure management is big problem. So, itâ€™ s necessary to build a Internet comprehensive management system with the help of the existing technologies and based on the characters of virtual society, and at first we need get massive Internet websites records.The target of this paper is to design and implement a crawler system based on hadoop, and crawl massive data from internet to offer website information for Internet Website Resource and Data Management System. The way of this system is to collect website information form Province Gateway Website, process and store these data for website record search.This paper deeply analyzes the basic work discipline, framework and strategy of web crawler, and describes hadoop distributed platform technologies, including the hadoop distributed file system and map-reduce computational model. The paper analyzessystemâ€™s business requests and performance requests, and completes the system total design, including physical architecture design, functional module design and working process design, then. Then complete the code implementation of modules and system tests according to the before system design. Eventually, the paper implements a distributed web crawler systembased on hadoop, and this system uses hdfs and map-reduce model. This system, to some extent, solves the low efficiency and low extensibility problem of one machine crawler, improve the crawling speed and quality and offer massive website record information for resource management system.

Keywords/Search Tags:

distributedcrawlerhadoophdfsmapreduce

PDF Full Text Request

Related items