Font Size: a A A

Design And Implementation Of Distributed Web Crawler Based On Hadoop

Posted on:2019-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2428330545459931Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advancement of Big Data and Artificial Intelligence,the information on the Internet is advancing with times,showing an explosive growth trend.The traditional stand-alone web crawler can never meet the demand for fast,efficient and reliable access to the rapid growth of network information resources on the Internet.In recent years,with the continuous development of Hadoop,Spark and other distributed technologies of Big Data,these distributed technologies of Big Data make it possible for massive network information resources to be stored and calculated.Here,based on Hadoop,the paper designs and implements a distributed web crawler system,aiming to solve the problem of traditional stand-alone web crawler system and distributed web crawler that is not applicable to the scene of massive web information resources and has low speed for web crawling and has the single point of failure on master machine node,to make web crawler be applicable to the scene of massive web information resources and has a faster speed for web crawling.The main contents of this paper is as follows:Firstly,this paper studies the basic principles and workflow of traditional stand-alone web crawler and Hadoop related components with main focus on in-depth research on duplicated URL detection algorithm and duplicated web pages detection algorithm in web crawlers.Aiming to solve the problem of the duplicated URL detection algorithm of Bloom Filter and the duplicated webpage detection algorithm of Sim Hash,and combining the Hadoop distributed programming technology,this paper designs and implements a distributed duplicated URL detection algorithm based on Sim Hash which is combined with the contents of web pages and URL links for duplicated detection of URL and has high efficiency for duplicated detection of URL and fast speed for web crawling,thus applicable to the scene of duplicated detection of massive web data.Secondly,this paper conducts a detailed design and implement for a distributed web crawler system based on Hadoop,including systematic demand analysis,architecture design,work flow design,functional structure design,distributed storage design,and the use of HDFS and Map Reduce distributed programming technique for the implementation of the various functional modules of the system.Thirdly,this paper builds a high availability distributed cluster test environment based on the Hadoop high availability.According to the test plan given in this paper,this paper tests the function,performance,scalability,high availability and duplicated URL detection algorithm of the system.After analyzing the test results,we obtain that the distributed web crawler system designed in this paper meet the requirement of all function of each module in the system,its scalability,its high availability and its duplicated URL detection.At the same time,the system has high crawler efficiency and high efficiency for duplicated detection of URL,thus meeting the performance requirements of the system and applicable to the scene of massive web information resources.
Keywords/Search Tags:distributed web crawler, Hadoop, duplicated URL detection, SimHash
PDF Full Text Request
Related items