| With the rapid development of the network, In the face of such a massive media workson network,Copyright protection has become an urgent problem.That is an effectivecopyright protection scheme for digital content to track the copy of contents by the use of thecopy detection technology.How to obtain the mass media resources is a difficulty in the copydetection technology.Today,the rapid development of cloud computing presents a greatadvantage in terms of mass data processiong.In view of this,Using the framework of Hadoopto design and implem the video crawler,which is used to collect tested video data set forcopy detection system.This paper mainly studies the hadoop framework such as the calculation model ofMapReduce and the HDFS distributed file system as long as the key technologies ofdistributed crawler.It also discusses the Hadoop framework’s advantage in distributedcrawler system,such as the scheme of solving the task scheduling and load balancing and thescheme of how to ensure the stability of the entire crawler system when the child nodesdynamically exit which is a major problem in the distributed crawler. All of them are verycomplex and easy to make mistakes.But that the Hadoop framework solves theseproblems.Hence,a distributed video crawler system is designed base on Hadoop. By usingthe MapReduce computation model to achieve the crawling,analysis,duplicate URLremoval,downloads and other computing tasks. And using the partition for the URL sets toensure that each crawling node load balancing firstly.By using the HDFS distributed filesystem to do the storage for the coordination with the computing model.Finally,do the functionality and performance testing by configuring multiple crawlingnode for the video crawler,the test results demonstrate the feasibility and efficiency ofdistributed crawler based on Hadoop architecture.And put forward the prospects for theinsufficient of the crawler system. |