Research On Techniques Of Similarity-based Distributed Duplication Elimination

Posted on:2015-04-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y Yu

Full Text:PDF

GTID:2298330431486348

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Under the background of the era of big data, the increasing of the amount ofdata has brought great challenges to the field of data storage and backup.Deduplication technology can effectively reduce data quantity, and reduce the cost ofthe data center. Because of its low memory usage, high throughput and support for thedistribution, deduplication technology based on the similarity gradually gained theimportance and popularity in the application. But there are still some deficiencies. Forexample, there are duplicate data between different similar collections; statelessrouting strategy based on representative fingerprints which is likely to cause loadimbalance between nodes; no real parallel on similar set search.In this paper, based on the disadvantages of the technology, combined with itsdeduplication principle, we design a distributed deduplication architecture built on theHadoop distributed platforms. First, the global index and the local index that set forsimilar collection deposit distribution, so as to achieve the parallel operation ofsimilar collection to find. Second, the circular policy can gradually reduce the size ofsimilar data blocks. Lastly, the strategic of multi-file parallel processing, furtherimproves the parallel degree of the distributed architecture. Through the modelingapproach to optimize the number of cycles performed, users can choose loopexecution times according to the execution time that distributed architecture and therealization of the duplicate removal size. Simulation results on differentcharacteristics of the real backup data sets show that our model has a smaller memoryusage, higher throughput rate and can adapt to the demand of large amount of dataprocessing comparing to the traditional techniques based on locality-based similarityDDFS and technical Extreme Binning.

Keywords/Search Tags:

deduplication, similarity, index optimization, distributed system, MapReduce

PDF Full Text Request

Related items

1	Design And Research On A High-performance Deduplication System
2	Research On Deduplication Algorithm Based On Similarity And Chunking
3	Research On Key Technologies Of Resources Management In Cloud Storage System
4	Research On Data Deduplication Technology Based On Hadoop
5	Research On Data Security Deduplication In Cloud Storage
6	The Design And Implementation Of Data Deduplication Index Server
7	Research On Keyword Search On Graphs Based On MapReduce
8	Research On Routing Algorithm For Distributed Data Deduplication Systems
9	OBF-Index:A Distributed Multi-Dimensional Index Based On Ordinal Bloom Filter
10	Research On Performance Optimization Of Virtual Machine Image Deduplication For Cloud Data Center