Font Size: a A A

Design And Research On A High-performance Deduplication System

Posted on:2014-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y J LuFull Text:PDF
GTID:2268330425983934Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, data has become one of the keyfactors determining the enterprise survival and development. However, theever-growing volume and digital information have raised a critical and mountingdemand on large-scale and high-performance data store. The statistic shows that alarge number of duplicate data exists between the growing amou nts of data. Datadeduplication technology provides a new way to restrain the excessive growth of data,improve resource utilization and reduce management costs.As an emerging data compression technology, data deduplication is facing manyproblems and challenges. This thesis mainly focuses on data deduplicationperformance, scalability, throughput and data fragmentation issues around large-scalebackup systems. The main contributions of this thesis include:To overcome the poor scalability of existing deduplication system, a distribu tedstorage deduplication architecture based on centralized management is proposed. Thisarchitecture uses a fingerprint space policy to index data chunks, this feature allowsthe system extend the index capacity and storage node dynamically on demand, andsupports the parallel operation of the indexing and storing data chunks, which has agood performance and scalability.The data chunks stored in container are layered and orderly, the capacity of eachlayer growth exponentially. The orderly feature makes the merge of data chunks ineach layer become possible, data fragment can be cleaned up during the mergingoperation and turn the random small disk I/Os to sequential large disk I/Os. Thistechnology not only enhanced the throughput and storage capacity of a single nodesufficiently, but can be well applied to the distributed environment.Each container has a separate cache and each file in container has its own Bloomfilter, so the system do not need to maintain a global cache and the Bloom filter,which dispersing the memory overhead, and Bloom filter is generated with the file,thus solve the deletion and persistence problem, which effectively solve the diskbottleneck in a distributed environment.
Keywords/Search Tags:Deduplication, Distributed, Scalability, Data fragmentation, Data index
PDF Full Text Request
Related items