Research On Data Deduplication Technology In Network Storage System

Posted on:2013-06-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z D Zhou

Full Text:PDF

GTID:1228330392957284

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Data deduplication technology is a lossless data compression method in networkstorage systems, which can limit the excessive growth of data storage overhead andreduce the cost of system construction and operation. As the amount of digital datagrowing explosively, the data de-duplication technology has attracted a great interestin both academia and industry. However, there are still many technical issues in thisresearch area, such as improving data compression rate, processing time, data reliability,etc. Hence this dissertation focuses on data deduplication technology in the networkstorage system in detail from three aspects: processing efficiency, data reliability and datadistribution. The main contributions of this dissertation include:Based on theoretical analysis by a mathematic model and experimental results onreal world datasets, performance influencing factors of deduplication processing arediscussed in detail form both data duplication characteristics and processing method.According to analysis results, a duplication characteristics aware deduplicationframework is proposed. The novel deduplication framework employs two techniques:Semantic-aware Data Grouping and Progressive Data Chunking Setting Method.Semantic-aware Data Grouping is a pre-processing of data deduplication that groups thedata by semantic information and facilitates the framework to improve performance byexploiting the data duplication characteristics. Progressive Data Chunking SettingMethod is an approach to obtain an optimal data chunking setting for a specific datagroup. The experiments show the proposed deduplication framework improves theperformance in terms of data compression rate and processing time.While data deduplication can achieve storage space savings effectively, it alsoreduces data reliability. An optimal redundancy model is proposed to obtain the optimalreplication degrees for the individual data object. Furthermore, the Popularity-based DataRedundancy Scheme is presented, which adopts the proposed model to achieve an optimum level of data reliability at minimum storage space overhead. In order to improvethe feasibility, several acceleration techniques are proposed in the scheme, which employsampling and empirical intermediate data to reduce the computational cost of the model.The evaluation experiment demonstrated the proposed scheme has improved the datareliability for the deduplication solutions.Considering the traditional data distribution schemes confront challenges to improveflexibility and load balancing in the heterogeneous distributed storage environment, theCapacity-Aware Data Distribution Method is proposed. Firstly, a consistent hashingbased capacity-aware data distribution scheme is proposed, which employs virtual nodeallocation method and capacity-aware allocation method to improve flexibility and loadbalancing. Secondly, a high reliability capacity-aware data distribution scheme isproposed, which can support multi-replica data redundancy scheme. The experimentsshow the two proposed data distribution scheme can improve load balancing in theheterogeneous distributed storage environment.

Keywords/Search Tags:

Network Storage, Data Deduplication, Data Redundancy, DataReliability, Data Distribution

PDF Full Text Request

Related items

1	Research On Duplicate Data Detection In Data Deduplication
2	HTDRDedu:The Design And Implementation Of A Distributed Backup Data Deduplication System
3	Research Of Data Deduplication In Data Disaster Tolerance Systems
4	Research On Data Deduplication Method Of Key-value Storage System Based On LSM Tree
5	Research On Data Distribution Strategies For Cloud Storage Based On Data Redundancy
6	Data Distribution And Hybrid Redundancy Method In Multi-cloud Storage
7	Research On Security Technologies Of Data Deduplication For Cloud Storage Systems
8	Research On High I/O Performance Data Deduplication In Primary Storage System
9	The Design And Implementation Of Data Deduplication With Garbage Data Removal Policy
10	Research And Design Of Data Deduplication System For Distributed Cluster In Cloud Storage