Research On Deduplication Algorithm Based On Similarity And Chunking

Posted on:2019-05-18

Degree:Master

Type:Thesis

Country:China

Candidate:H Ge

Full Text:PDF

GTID:2428330545954775

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Science and technology infiltrate into all walks of life with extremely fast speed,make the explosive growth in the volume of data,in order to reduce storage cost and network overhead,data deduplication technology is more and more widely applied in cloud storage,backup and archive system.However,in chunking particle size deduplication algorithm,the segmentation effect of traditional chunking algorithm is poor.Fixed-sized partition(FSP)is not very effective to satisfy data variability,content-defined chunking(CDC)with too much human intervention,the parameters based on classic chunking deduplication performance of the algorithm is not stable.In addition,with the increase of data level,more and more fingerprint information cannot be detected quickly.Mechanical I/O operation significantly increases the time of the algorithm,making performance difficult to guarantee.There are some problems in Big Data that the chunking size of the deduplication algorithm for content-defined chunking(CDC)is difficult to control,and the expense of fingerprint calculation and comparison is high,and the parameter needs to be set in advance.Thus,deduplication algorithm based on winnowing fingerprint matching(CAWM)is proposed.Firstly,the chunking size prediction model is introduced before chunking,which can accurately calculate proper chunking size according to the application scenario;Then,the ASCII/Unicode encoding method is used as the data block fingerprint in the calculation of the fingerprint.Finally,when determining the block boundary,the proposed algorithm based on chunk fingerprint matching does not need to set the parameters in advance to make the fingerprint calculation and comparative overhead reduced.The experimental results on a variety of datasets show that CAWM is about 10% higher than the FSP and CDC algorithms in deduplication rate,and about 18% in fingerprint computing and contrast overhead.As a result,the chunking size and boundaries of CAWM are more consistent with data characteristics,reducing the impact of parameter settings on the performance of deduplication algorithms,meanwhile,effectively eliminating more duplicate data when dealing with a large variety of different types of dataFocused on deduplication algorithm of fingerprint comparison I/O bottleneck problem,put forward a secondary index deduplication algorithm based on similar clustering.Firstly calculating all the data blocks' s Simhash values,based on the Hamming distance between Simhash values,proposes an adaptive similar clustering algorithm,and all clustering centers' s informations form the primary index stored in memory.Then,calculating data blocks' s MD5 in each cluster and forming the secondary index stored in clustering center.When need to check blocks,computing the Hamming distance between detectioned block's Simhash and all clustering centers' s Simhash,load the cluster that has the minimum Hamming distance into memory,and comparing MD5 fingerprints.The experimental results show that the algorithm increased 23% in deduplication rate,has no false positive rate,at the same time it has considerable improvement in the speed of fingerprint comparison,only one time I/O operation is generated at each detection,hence it has more efficient performance.

Keywords/Search Tags:

deduplication, Chunking based on Winnowing, similar clustering, secondary index, Simhash

PDF Full Text Request

Related items

1	Research On Web Page Deduplication Technology Based On Simhash And Hierarchical Clustering Algorithm
2	Research On Deduplication Technology In Cloud Storage
3	Research On Key Techniques Of Data Deduplication In The Environment Of Big Data
4	Research On High Performance Secure Deduplication Technology For Cloud Storage
5	Research On The Index For Similar Search Based On Hamming Distance
6	Use of GPU architecture to optimize Rabin fingerprint data chunking algorithm by concurrent programmin
7	Research And Improvement Of Text Similarity Detection Based On Simhash
8	Research On Similarity-based Secure Data Deduplication In Cloud Computing
9	Research In Data-deduplication Based On Storage System
10	Research On Text Similarity Detection Algorithm Based On Simhash