Font Size: a A A

Design And Implement Of Distributed Multimedia Web Crawler System

Posted on:2013-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2248330392957873Subject:Computer applications
Abstract/Summary:PDF Full Text Request
The rapid development of Internet and multimedia technology in digital multimedia hasnot only greatly promoted the online marketing and communication of digital multimedia,but also brought copyright infringement issues. Because digital works can be copied easily,anyone can easily re-distribute or sell digital media on Internet. By using web crawler tosearch the multimedia works on the Internet actively, track leaks via copy detection anddigital fingerprinting technology, can effectively provide protection of copyright.The design of distributed Web crawler is a challenging work, this paper discussed thegeneral framework design of crawler, and propose a practical distributed architecture design,which effective combine centralized and distributed design advantages, make a bettersolution for task scheduling and repeated crawling caused by crawl node’s dynamic joiningor leaving. Web crawler’s implementation involves a number of key technologies, in order toimprove the speed of crawling, get the important resources as soon as possible, this paperadopt a breadth-first search strategy based on URL filtering, discarding unimportant URLs;after a in-depth discussion and analysis on Bloom Filter-based method to remove duplicateURLs, we propose a distributed URLs duplicate removal method based on Bloom Filter, inwhich each crawling node only has to maintain their own URLs duplicate removal structure,the more nodes crawling, the less URLs crawling node need to deduplicate, and the lessmemory crawl node requires, speed up the rate of deduplicate; crawling nodes usemulti-threading can effectively speed up the crawling pace, but there is competition betweenthreads, this paper discusses the problems experience in design of multi-threaded.This paper discussed the problems and solutions in crawling multimedia resources,especially the video download problems existed in video sharing site, and make video siteYouku an example to describe the solution. An actual running test was made to verify theuseful of the distributed multimedia Web crawling system.
Keywords/Search Tags:web crawler, distributed, multimedia
PDF Full Text Request
Related items