Research And Realization Based On The Similarity Of Duplicate-adjacency Data Chunks For De-duplication System Performance

Posted on:2017-04-05

Degree:Master

Type:Thesis

Country:China

Candidate:J H Tan

Full Text:PDF

GTID:2428330536462638

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the advent of globalization of information,the global amount of data is exploding,but the growth rate of physical storage resources far less than the growth rate of the total amount of data,traditional storage methods also need to be improved.According to statistics,there is a lot of redundant data in the storage systems.These same or similar data redundancy led to an explosion of growth in the amount of data.De-duplication can identify and delete duplicate data in the storage system,therefore it is widely used in various related fields.As an emerging technology,there is still a lot to be improved,such as extra overhead in the case of deduplication index missed,low efficiency similarity detection scheme and the resource waste caused by the storage of zero cited chunks.For file backup system,in order to improve the above shortcomings,so as enhance system performance,the deduplication data deletion scheme based on the similarity of the adjacent chunks is proposed.The data streams should be chunking at first to get better detection.As different chunking algorithm,the chunks will be very different,and it will directly affect the effect of deduplication.Through the comparison of different chunking algorithm,sliding block is selected.A chunk is retrieved before retrieving the chunk needs to get its fingerprint.Fingerprint represents the corresponding chunk is the basic unit of detection.As different hash algorithm to calculate the arithmetic accuracy of the fingerprint will be different,and the possibility of hash collision occurs will also be different.A 160 bit secure hash algorithm SHA-1 will meet the system requirements,which probability of a hash collision between 2-55 and 2-75.Due to the fingerprint index is too large to be complete stored in memory,it will need to access the hard disk causing overhead.In this regard the introduction of Bloom Filter.Bloom Filter can quickly determine if a certain element meet the set or not.After conducting the same deduplication detection,there still stores a mass of similar data.If all of the chunks are detected similarity is impractical.In order to obtain more efficient similar detection efficiency,this thesis establishes a chunk value evaluation model,and based on this,a similar detection threshold is proposed.Threshold is based on the hotkey of chunk,its historical hotkey and repetition rate.The hotkey is based on the number of chunk citations and the last accessed time.And in order to prevent the occurrence of system jitter,the historical hotkey is proposed which based on the historical control parameter.By threshold determines whether the chunk to be similar detected.After making data deduplication,there still exist the chunk of zero citations in the system.And these chunks need to be cleaned up.It can be determined by ordinal parameter set before that if the chunk is a zero citations,that is,when ordinal parameter equal to zero,it indicates that the chunk is not cited,which means it can be removed.While removing a chunk can cause system jitter,hence time parameter and historical parameter set above are introduced again to prevent the jittering of system,so that can avoid deleting current cited data.Solution is obtained by combining with the distributed platform Hadoop in this thesis.Experiments show that the average throughput of four tests after introduced Bloom Filter,is increased from 756.3MB/S to 832.5MB/S,which is a rise of 10.08%;comparing deduplication rates and throughput of DDFS Indexing and Extreme Binning,the deduplication rate of Adj-Dedup is higher than Extreme Binning and slightly lower than DDFS Indexing,and throughout of Adj-Dedup is greater than 800MB/S,which for Extreme Binning is about 500MB/S,while throughout of DDFS Indexing is less than 200MB/S.Deduplication solutions designed in this thesis have higher system performances.

Keywords/Search Tags:

Deduplication, Similarity, DELTA Compression, Data Cleansing

PDF Full Text Request

Related items

1	Compression Algorithm Based On The Delta Key Technology Of Large-scale Database Disaster Recovery
2	Research On Lossless Compression Approach Based On Local Similarity
3	Research And Implementation On Mass Data Cleaning In E-Government System
4	Research On High Performance Redundancy Elimination Techniques For Data Backup Systems
5	Research On Data Deduplication Based On File Access Patterns
6	Data Cleaning Algorithm And Applications
7	Research On Data Security Deduplication In Cloud Storage
8	Research On Key Technology In Mass Data Processing Based On Inline Deduplication
9	Research On Key Technologies For Data Access In Intelligent Terminals Of The Next Generation Broadcasting Network
10	Research Of Data Deduplication In Data Disaster Tolerance Systems