Font Size: a A A

Research On Data Deduplication Technology For Spark Platform Based On RDE Chunking Algorithm

Posted on:2023-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChaoFull Text:PDF
GTID:2568306830452594Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of big data technology,the amount of data generated by production and life is increasing explosively.Whether individuals or enterprises,the cost of data storage is becoming more and more expensive.How to provide effective and reliable data storage and backup services is a research hotspot.Data deduplication is a technology that can eliminate redundant data and provide reliable storage,and is widely used in large-scale storage systems,but there are still some problems:(1)The current content-based chunking algorithm has the problems of excessive chunk size variance and low throughput;(2)Map Reduce-based deduplication systems require a large number of write disk operations and have low computational iteration efficiency;(3)The current HDFS-based distributed deduplication system does not consider the mismatch between the size of HDFS chunks and content-based chunks,resulting in an excessive number of small HDFS files;(4)The distributed deduplication system using local fingerprint repository is difficult to detect the duplicate data between nodes,resulting in the low deduplication rate of the system.To address these problems,this paper proposes the Rapid Double Extremum(RDE)chunking algorithm and the S-Dedupe deduplication system.Firstly,to address the problem of excessive chunk size variance of the chunking algorithm,RDE adopts a bipolar judgment strategy,aiming to improve the algorithm’s ability to eliminate low-entropy strings.Secondly,to address the low throughput of the chunking algorithm,RDE uses a multi-byte sliding window to control the window sliding speed and the extremum computational granularity.Thirdly,to address the problem of excessive HDFS small files in HDFS-based deduplication systems,SDedupe reduces the generation of small files through block aggregation strategy.Finally,to address the problem of low deduplication rate of the system due to local fingerprint repository,S-Dedupe uses HBase to build a distributed fingerprint repository,aiming to mitigate the impact of local fingerprint repository on deduplication rate in a distributed environment.The experimental results show that RDE achieves significant improvement in chunk size variance and performs well in chunking throughput.S-Dedupe offers a 2.26 x improvement in throughput compared to conventional deduplication systems,and also excels in system deduplication rates and in controlling the number of small HDFS files.
Keywords/Search Tags:big data, data deduplication, content-based chunking, backup systems, distributed file systems
PDF Full Text Request
Related items