Font Size: a A A

Research On Optimization Methods To Data De-duplication For Archive Storage

Posted on:2014-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:S J HanFull Text:PDF
GTID:2268330422463470Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the improvement of the level of social information, data is becomingincreasingly important. At the same time, the demand of storage requirements forenterprise data center is explosively growing. The current storage system is mainlydesigned for data read and write performance and reliability, ignoring the associationbetween the data and redundant characteristics. This not only causes a waste of storagespace, but also makes it difficult for users to effectively manage a large number ofcomplex data structures. For this, De-duplication technology has been proposed in recentyears.On the basis of analyzing the characteristics of the metadata access and querying, thelayout of data and data reading and writing, a system architecture scheme is proposed, ofwhich the metadata and data access are separated from each other:(1) it uses a tripartitestructure consisting of client, metadata server and storage node;(2) access to metadata isseparated between client and metadata server, while access to the contents of files isseparated between client and storage node, so that the scheme has a high scalability andhigh access concurrency. In the de-emphasis function,(1) fixed block data partitioningmethod is used and the hash algorithm MD5and SHA-1is used as the fingerprint of a datablock;(2) a two layers of Bloom filter is used for the rapid identification and filtering ofthe hash fingerprints of data blocks, and a B+tree index structure is used as the persistentdata storage solution for the metadata of fingerprints. In order to further optimize the I/Operformance,(1) a data layout strategy of storing data in accordance with the data flowsub regional is adopted, which can obtain the spatial locality of data access;(2) with thecombination of client metadata and data caching mechanism, cache hit rate of file accessand file read/write performances can be improved.Finally, a de-duplication system prototype of a tripartite framework is designed andimplemented, and the functionality and performance are tested on top of the systemprototype. Function test results show that the de-duplication scheme in the data set of thevirtual machine images can get a data compression ratio up to130%; performance testresults show that the caching mechanism can improve the performance of file access; the fingerprint filtering statistics show that, using the two layers of bloom filter can get a highfingerprint filtering rate with the actual false positive rate of0.071%, which is within theallowed range of the theory false positive rate of0.1%.
Keywords/Search Tags:De-duplication, Distributed storage, Hash fingerprint filtering, Metadataorganization, Data layout
PDF Full Text Request
Related items