Design And Research On A High-performance Deduplication System

Posted on:2014-02-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Lu

Full Text:PDF

GTID:2268330425983934

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology, data has become one of the keyfactors determining the enterprise survival and development. However, theever-growing volume and digital information have raised a critical and mountingdemand on large-scale and high-performance data store. The statistic shows that alarge number of duplicate data exists between the growing amou nts of data. Datadeduplication technology provides a new way to restrain the excessive growth of data,improve resource utilization and reduce management costs.As an emerging data compression technology, data deduplication is facing manyproblems and challenges. This thesis mainly focuses on data deduplicationperformance, scalability, throughput and data fragmentation issues around large-scalebackup systems. The main contributions of this thesis include:To overcome the poor scalability of existing deduplication system, a distribu tedstorage deduplication architecture based on centralized management is proposed. Thisarchitecture uses a fingerprint space policy to index data chunks, this feature allowsthe system extend the index capacity and storage node dynamically on demand, andsupports the parallel operation of the indexing and storing data chunks, which has agood performance and scalability.The data chunks stored in container are layered and orderly, the capacity of eachlayer growth exponentially. The orderly feature makes the merge of data chunks ineach layer become possible, data fragment can be cleaned up during the mergingoperation and turn the random small disk I/Os to sequential large disk I/Os. Thistechnology not only enhanced the throughput and storage capacity of a single nodesufficiently, but can be well applied to the distributed environment.Each container has a separate cache and each file in container has its own Bloomfilter, so the system do not need to maintain a global cache and the Bloom filter,which dispersing the memory overhead, and Bloom filter is generated with the file,thus solve the deletion and persistence problem, which effectively solve the diskbottleneck in a distributed environment.

Keywords/Search Tags:

Deduplication, Distributed, Scalability, Data fragmentation, Data index

PDF Full Text Request

Related items

1	Research On Building Efficient Data Deduplication Storage Systems For Data Backup
2	Research On Routing Algorithm For Distributed Data Deduplication Systems
3	Study On Data Deduplication Technique For Data Backup Systems
4	Study On Data Fragmentation For Data Backup Systems
5	The Design And Implementation Of Data Deduplication Index Server
6	HTDRDedu:The Design And Implementation Of A Distributed Backup Data Deduplication System
7	Research On Data Deduplication Based On File Access Patterns
8	Research On Key Techniques Of Data Deduplication In The Environment Of Big Data
9	The Design And Implementation Of Data Deduplication With Garbage Data Removal Policy
10	Research On Performance Optimization Based On Container Characteristics In Deduplication-based Backup Systems