Font Size: a A A

Research On Deduplication Algorithm Based On Similarity And Chunking

Posted on:2019-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:H GeFull Text:PDF
GTID:2428330545954775Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Science and technology infiltrate into all walks of life with extremely fast speed,make the explosive growth in the volume of data,in order to reduce storage cost and network overhead,data deduplication technology is more and more widely applied in cloud storage,backup and archive system.However,in chunking particle size deduplication algorithm,the segmentation effect of traditional chunking algorithm is poor.Fixed-sized partition(FSP)is not very effective to satisfy data variability,content-defined chunking(CDC)with too much human intervention,the parameters based on classic chunking deduplication performance of the algorithm is not stable.In addition,with the increase of data level,more and more fingerprint information cannot be detected quickly.Mechanical I/O operation significantly increases the time of the algorithm,making performance difficult to guarantee.There are some problems in Big Data that the chunking size of the deduplication algorithm for content-defined chunking(CDC)is difficult to control,and the expense of fingerprint calculation and comparison is high,and the parameter needs to be set in advance.Thus,deduplication algorithm based on winnowing fingerprint matching(CAWM)is proposed.Firstly,the chunking size prediction model is introduced before chunking,which can accurately calculate proper chunking size according to the application scenario;Then,the ASCII/Unicode encoding method is used as the data block fingerprint in the calculation of the fingerprint.Finally,when determining the block boundary,the proposed algorithm based on chunk fingerprint matching does not need to set the parameters in advance to make the fingerprint calculation and comparative overhead reduced.The experimental results on a variety of datasets show that CAWM is about 10% higher than the FSP and CDC algorithms in deduplication rate,and about 18% in fingerprint computing and contrast overhead.As a result,the chunking size and boundaries of CAWM are more consistent with data characteristics,reducing the impact of parameter settings on the performance of deduplication algorithms,meanwhile,effectively eliminating more duplicate data when dealing with a large variety of different types of dataFocused on deduplication algorithm of fingerprint comparison I/O bottleneck problem,put forward a secondary index deduplication algorithm based on similar clustering.Firstly calculating all the data blocks' s Simhash values,based on the Hamming distance between Simhash values,proposes an adaptive similar clustering algorithm,and all clustering centers' s informations form the primary index stored in memory.Then,calculating data blocks' s MD5 in each cluster and forming the secondary index stored in clustering center.When need to check blocks,computing the Hamming distance between detectioned block's Simhash and all clustering centers' s Simhash,load the cluster that has the minimum Hamming distance into memory,and comparing MD5 fingerprints.The experimental results show that the algorithm increased 23% in deduplication rate,has no false positive rate,at the same time it has considerable improvement in the speed of fingerprint comparison,only one time I/O operation is generated at each detection,hence it has more efficient performance.
Keywords/Search Tags:deduplication, Chunking based on Winnowing, similar clustering, secondary index, Simhash
PDF Full Text Request
Related items