Font Size: a A A

Research And Application Of De-duplication Algorithm Based On Double Bloom Filter

Posted on:2014-06-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y W XiFull Text:PDF
GTID:2268330425484241Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the promotion of digital information process, It shows the exploding trendof amount of information, more and more storage space had been taken by data. Thestudy found that, there are a large number of redundant data in a centralized storagesystem for backup,some even more than60%. Through design efficient dataDe-duplication algorithm to eliminate redundant information and save storage spacehas become a hot research topic. In the present,The current re-scheduling algorithm,the data De-duplication algorithm baseed on file units get high speed, but theDe-duplication granularity is too coarse, can’t meet the storage space’s highutilization requirement; the block units De-duplication algorithm have betterDe-duplication effect, but need get file into a large number blocks, time-consuminggreatly increased. We attempt to overcome these problems by combining the bestaspects of both algorithm, proposed a hierarchical De-duplication structure, specificwork is as follows:(1) Aiming File De-duplication and data block De-duplication both havedrawback,we proposed a De-duplication algorithm based on double bloom filter. Thealgorithm uses two Bloom filters form two level De-duplication structure, decomposet De-duplication process into two parts:De-duplication of files and De-duplication ofdata blocks. Firstly,algorithm enter level1get file De-duplication,then divide theseun-duplicate files decieded by file level into blocks,get data blocksDe-duplication.Use this hierarchical De-duplication structure,delete duplicate filesdirectly through level one,no need enter block duplication,reduce the workload fordata block De-duplication;get data block level De-duplication granularity throughlevel two. Experiments show that, the algorithm get better De-duplication effectDe-duplication,shorten the computation time,while maintaining a low false positiverate (Bloom filter characteristics, take the un-duplicate files mistaken for duplicatefiles).(2) This paper applies this algorithm to document backup system, realizeDe-duplication, because of the algorithm using Bloom filter, exist miscalculation,namely the false positive error.need query metadata (save data information)fordetermine the decision right or wrong. Proread metadata into memory, and search datain memory when algorithm determine data is duplicate. If not found,query the index file. Through the3level query, eliminate the false positive error.The data will not beduplicate while not matched to the fingerprint in the2stage. To store data and relatedinformation, this is a false positive error correction. In the query process, using themetadata preread techniques to reduce disk IO, raise the efficiency of De-duplication.(3) Due to the false positive error correction need for a large number of accessto metadata, increased time overhead, this paper design the index file cache metadataoperation, ensure efficient, in order to improve the De-duplication efficiency.But thesystem handles number limits the number of open files simultaneously.If get thelimit,index file just keep open、write、close,in this time De-duplication’cost andtime-consuming will increase sharply.To confront this situation,we use the DynamicHash algorithm to address the metadata’s store,keep metadata in a few index file,which overcomes the limitation of the handle.The experimental results show that, in the document backup system, use theweight of different scheduling algorithms to the repetition rate of the documentcollection, compared with other2kinds of backup tool, while maintaining g ood rowweight rate, the average time is shortened to10%.
Keywords/Search Tags:Data De-duplication, Bloom filter, CDC algorithm, false positiverate, preread metadata, limitation of the handle
PDF Full Text Request
Related items