Research On Optimization Methods To Data De-duplication For Archive Storage

Posted on:2014-11-17

Degree:Master

Type:Thesis

Country:China

Candidate:S J Han

Full Text:PDF

GTID:2268330422463470

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the improvement of the level of social information, data is becomingincreasingly important. At the same time, the demand of storage requirements forenterprise data center is explosively growing. The current storage system is mainlydesigned for data read and write performance and reliability, ignoring the associationbetween the data and redundant characteristics. This not only causes a waste of storagespace, but also makes it difficult for users to effectively manage a large number ofcomplex data structures. For this, De-duplication technology has been proposed in recentyears.On the basis of analyzing the characteristics of the metadata access and querying, thelayout of data and data reading and writing, a system architecture scheme is proposed, ofwhich the metadata and data access are separated from each other:(1) it uses a tripartitestructure consisting of client, metadata server and storage node;(2) access to metadata isseparated between client and metadata server, while access to the contents of files isseparated between client and storage node, so that the scheme has a high scalability andhigh access concurrency. In the de-emphasis function,(1) fixed block data partitioningmethod is used and the hash algorithm MD5and SHA-1is used as the fingerprint of a datablock;(2) a two layers of Bloom filter is used for the rapid identification and filtering ofthe hash fingerprints of data blocks, and a B+tree index structure is used as the persistentdata storage solution for the metadata of fingerprints. In order to further optimize the I/Operformance,(1) a data layout strategy of storing data in accordance with the data flowsub regional is adopted, which can obtain the spatial locality of data access;(2) with thecombination of client metadata and data caching mechanism, cache hit rate of file accessand file read/write performances can be improved.Finally, a de-duplication system prototype of a tripartite framework is designed andimplemented, and the functionality and performance are tested on top of the systemprototype. Function test results show that the de-duplication scheme in the data set of thevirtual machine images can get a data compression ratio up to130%; performance testresults show that the caching mechanism can improve the performance of file access; the fingerprint filtering statistics show that, using the two layers of bloom filter can get a highfingerprint filtering rate with the actual false positive rate of0.071%, which is within theallowed range of the theory false positive rate of0.1%.

Keywords/Search Tags:

De-duplication, Distributed storage, Hash fingerprint filtering, Metadataorganization, Data layout

PDF Full Text Request

Related items

1	Research On Distributed Storage Technology Based On Data De-Duplication And CHORD Protocol
2	Application And Implementation Of De-duplication Technology In Cloud Storage
3	Research And Implementation Of On-line De-duplication Technology
4	Design And Implementation Of A Backup System Based On Data De-Duplication
5	Research And Implementation Of The De-duplication Mechanism For The Mass Data Backup
6	Research And Implementation Of Data De-duplication Technology
7	Research On Technologies For High-effect Data De-duplication
8	Cloud Storage Data Storage Mechanism To Avoid Duplication Of Research
9	Research On Data Storage And Search Methods Of Structured Data Based On HDFS
10	Research Of Endurance-Aware Data Layout For SSD Storage Clusters