Font Size: a A A

Research On Techniques Of Similarity-based Distributed Duplication Elimination

Posted on:2015-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2298330431486348Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Under the background of the era of big data, the increasing of the amount ofdata has brought great challenges to the field of data storage and backup.Deduplication technology can effectively reduce data quantity, and reduce the cost ofthe data center. Because of its low memory usage, high throughput and support for thedistribution, deduplication technology based on the similarity gradually gained theimportance and popularity in the application. But there are still some deficiencies. Forexample, there are duplicate data between different similar collections; statelessrouting strategy based on representative fingerprints which is likely to cause loadimbalance between nodes; no real parallel on similar set search.In this paper, based on the disadvantages of the technology, combined with itsdeduplication principle, we design a distributed deduplication architecture built on theHadoop distributed platforms. First, the global index and the local index that set forsimilar collection deposit distribution, so as to achieve the parallel operation ofsimilar collection to find. Second, the circular policy can gradually reduce the size ofsimilar data blocks. Lastly, the strategic of multi-file parallel processing, furtherimproves the parallel degree of the distributed architecture. Through the modelingapproach to optimize the number of cycles performed, users can choose loopexecution times according to the execution time that distributed architecture and therealization of the duplicate removal size. Simulation results on differentcharacteristics of the real backup data sets show that our model has a smaller memoryusage, higher throughput rate and can adapt to the demand of large amount of dataprocessing comparing to the traditional techniques based on locality-based similarityDDFS and technical Extreme Binning.
Keywords/Search Tags:deduplication, similarity, index optimization, distributed system, MapReduce
PDF Full Text Request
Related items