Research On Data Deduplication Technology For Spark Platform Based On RDE Chunking Algorithm

Posted on:2023-10-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y Chao

Full Text:PDF

GTID:2568306830452594

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of big data technology,the amount of data generated by production and life is increasing explosively.Whether individuals or enterprises,the cost of data storage is becoming more and more expensive.How to provide effective and reliable data storage and backup services is a research hotspot.Data deduplication is a technology that can eliminate redundant data and provide reliable storage,and is widely used in large-scale storage systems,but there are still some problems:(1)The current content-based chunking algorithm has the problems of excessive chunk size variance and low throughput;(2)Map Reduce-based deduplication systems require a large number of write disk operations and have low computational iteration efficiency;(3)The current HDFS-based distributed deduplication system does not consider the mismatch between the size of HDFS chunks and content-based chunks,resulting in an excessive number of small HDFS files;(4)The distributed deduplication system using local fingerprint repository is difficult to detect the duplicate data between nodes,resulting in the low deduplication rate of the system.To address these problems,this paper proposes the Rapid Double Extremum(RDE)chunking algorithm and the S-Dedupe deduplication system.Firstly,to address the problem of excessive chunk size variance of the chunking algorithm,RDE adopts a bipolar judgment strategy,aiming to improve the algorithm’s ability to eliminate low-entropy strings.Secondly,to address the low throughput of the chunking algorithm,RDE uses a multi-byte sliding window to control the window sliding speed and the extremum computational granularity.Thirdly,to address the problem of excessive HDFS small files in HDFS-based deduplication systems,SDedupe reduces the generation of small files through block aggregation strategy.Finally,to address the problem of low deduplication rate of the system due to local fingerprint repository,S-Dedupe uses HBase to build a distributed fingerprint repository,aiming to mitigate the impact of local fingerprint repository on deduplication rate in a distributed environment.The experimental results show that RDE achieves significant improvement in chunk size variance and performs well in chunking throughput.S-Dedupe offers a 2.26 x improvement in throughput compared to conventional deduplication systems,and also excels in system deduplication rates and in controlling the number of small HDFS files.

Keywords/Search Tags:

big data, data deduplication, content-based chunking, backup systems, distributed file systems

PDF Full Text Request

Related items

1	Performance Optimization Of Data Deduplication In Backup Systems
2	Research On Key Technologies Of Data Deduplication For Backup System
3	Study On Data Deduplication Technique For Data Backup Systems
4	HTDRDedu:The Design And Implementation Of A Distributed Backup Data Deduplication System
5	Research On Performance Optimization Based On Container Characteristics In Deduplication-based Backup Systems
6	Design And Implementation Of A Backup System Based On Data De-Duplication
7	Research On Building Efficient Data Deduplication Storage Systems For Data Backup
8	Research On Distributed Data Full Backup And Incremental Backup Of File System
9	The Design And Implementation Of Data Deduplication With Garbage Data Removal Policy
10	Study On Data Fragmentation For Data Backup Systems