| The rapid development of Internet and multimedia technologies although promoted the online distribution of digital multimedia works,but also brought copyright infringement issues.Because of the convenience of the network,multimedia works on the Internet can be easily copied,while anyone can get a copy of a multimedia copyright protected and spread it through the network.Therefore,using web crawler to search and download the multimedia works on the Internet,and then with the use of technology about content-based similar audio retrieval can effectively protect the copyright of multimedia works.This thesis designs and implements a distributed video crawler system,which can search and download multimedia works on the Internet.The main work is as follows:1.This thesis designs a web analysis module and a video download module,which can run in multithread by maintaining a thread pool;2.This thesis proposes a crawler system which can download video by continuous transmission on the breakpoint.3.This thesis presents a system framework for distributed video crawler.Compared to ordinary web document resources,the volume of video resources is more massive,which needs a longer time to download and consumes a lot of system resources and network bandwidth,while the web page analysis is very fast and consume limited resource.We takes into account the characteristics,and design a system which contains two nodes.The center node is responsible for the analysis of the web site and the another node is responsible for the video downloaded.And then,we achieve the system on Hadoop.Based on the lab’s similar audio searching system,this thesis proposes and implements a new index structure.The main work and innovation are as follows:1.Analyzing the lab’s similar audio searching system,especially the index structure,and pointed out its shortcomings;2.This thesis presents a index-structure algorithm based on hamming embeding,and implements the index structure based on the original system of audio processing framework.With the use of the new algorithm,the memory consumption was decreased significantly;3.In this thesis,we propose a cascaded-quantification index algorithm,which greatly reduces the memory consumption by using the new index-structure,and the detection accuracy is close to the laboratory system method,which can be applied to large databases. |