Design And Implementation Of Distributed Web Crawler Based On Hadoop

Posted on:2019-08-19

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2428330545459931

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the advancement of Big Data and Artificial Intelligence,the information on the Internet is advancing with times,showing an explosive growth trend.The traditional stand-alone web crawler can never meet the demand for fast,efficient and reliable access to the rapid growth of network information resources on the Internet.In recent years,with the continuous development of Hadoop,Spark and other distributed technologies of Big Data,these distributed technologies of Big Data make it possible for massive network information resources to be stored and calculated.Here,based on Hadoop,the paper designs and implements a distributed web crawler system,aiming to solve the problem of traditional stand-alone web crawler system and distributed web crawler that is not applicable to the scene of massive web information resources and has low speed for web crawling and has the single point of failure on master machine node,to make web crawler be applicable to the scene of massive web information resources and has a faster speed for web crawling.The main contents of this paper is as follows:Firstly,this paper studies the basic principles and workflow of traditional stand-alone web crawler and Hadoop related components with main focus on in-depth research on duplicated URL detection algorithm and duplicated web pages detection algorithm in web crawlers.Aiming to solve the problem of the duplicated URL detection algorithm of Bloom Filter and the duplicated webpage detection algorithm of Sim Hash,and combining the Hadoop distributed programming technology,this paper designs and implements a distributed duplicated URL detection algorithm based on Sim Hash which is combined with the contents of web pages and URL links for duplicated detection of URL and has high efficiency for duplicated detection of URL and fast speed for web crawling,thus applicable to the scene of duplicated detection of massive web data.Secondly,this paper conducts a detailed design and implement for a distributed web crawler system based on Hadoop,including systematic demand analysis,architecture design,work flow design,functional structure design,distributed storage design,and the use of HDFS and Map Reduce distributed programming technique for the implementation of the various functional modules of the system.Thirdly,this paper builds a high availability distributed cluster test environment based on the Hadoop high availability.According to the test plan given in this paper,this paper tests the function,performance,scalability,high availability and duplicated URL detection algorithm of the system.After analyzing the test results,we obtain that the distributed web crawler system designed in this paper meet the requirement of all function of each module in the system,its scalability,its high availability and its duplicated URL detection.At the same time,the system has high crawler efficiency and high efficiency for duplicated detection of URL,thus meeting the performance requirements of the system and applicable to the scene of massive web information resources.

Keywords/Search Tags:

distributed web crawler, Hadoop, duplicated URL detection, SimHash

PDF Full Text Request

Related items

1	Research And Implementation Of Distributed Web Crawler
2	Design And Research Of Distributed Web Crawler Based On Hadoop
3	Product Tag Extraction Based On User Reviews Under Distributed Crawler
4	Research And Implementation On Key Technologies For Distributed Universal Web Crawler System
5	Research And Implementation Of Distributed Web Crawler Based On Hadoop
6	Study Based On Hadoop Distributed Web Crawler
7	Distributed Crawler Based On Hadoop
8	Distributed Web Crawler Technology Based On Hadoop
9	Design And Implementation Of A Distributed Web Crawler System Based On Hadoop
10	Research On Optimization Of Hadoop Distributed Web Crawler System