Font Size: a A A

Research On Data Deduplication Technology In Network Storage System

Posted on:2013-06-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z D ZhouFull Text:PDF
GTID:1228330392957284Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Data deduplication technology is a lossless data compression method in networkstorage systems, which can limit the excessive growth of data storage overhead andreduce the cost of system construction and operation. As the amount of digital datagrowing explosively, the data de-duplication technology has attracted a great interestin both academia and industry. However, there are still many technical issues in thisresearch area, such as improving data compression rate, processing time, data reliability,etc. Hence this dissertation focuses on data deduplication technology in the networkstorage system in detail from three aspects: processing efficiency, data reliability and datadistribution. The main contributions of this dissertation include:Based on theoretical analysis by a mathematic model and experimental results onreal world datasets, performance influencing factors of deduplication processing arediscussed in detail form both data duplication characteristics and processing method.According to analysis results, a duplication characteristics aware deduplicationframework is proposed. The novel deduplication framework employs two techniques:Semantic-aware Data Grouping and Progressive Data Chunking Setting Method.Semantic-aware Data Grouping is a pre-processing of data deduplication that groups thedata by semantic information and facilitates the framework to improve performance byexploiting the data duplication characteristics. Progressive Data Chunking SettingMethod is an approach to obtain an optimal data chunking setting for a specific datagroup. The experiments show the proposed deduplication framework improves theperformance in terms of data compression rate and processing time.While data deduplication can achieve storage space savings effectively, it alsoreduces data reliability. An optimal redundancy model is proposed to obtain the optimalreplication degrees for the individual data object. Furthermore, the Popularity-based DataRedundancy Scheme is presented, which adopts the proposed model to achieve an optimum level of data reliability at minimum storage space overhead. In order to improvethe feasibility, several acceleration techniques are proposed in the scheme, which employsampling and empirical intermediate data to reduce the computational cost of the model.The evaluation experiment demonstrated the proposed scheme has improved the datareliability for the deduplication solutions.Considering the traditional data distribution schemes confront challenges to improveflexibility and load balancing in the heterogeneous distributed storage environment, theCapacity-Aware Data Distribution Method is proposed. Firstly, a consistent hashingbased capacity-aware data distribution scheme is proposed, which employs virtual nodeallocation method and capacity-aware allocation method to improve flexibility and loadbalancing. Secondly, a high reliability capacity-aware data distribution scheme isproposed, which can support multi-replica data redundancy scheme. The experimentsshow the two proposed data distribution scheme can improve load balancing in theheterogeneous distributed storage environment.
Keywords/Search Tags:Network Storage, Data Deduplication, Data Redundancy, DataReliability, Data Distribution
PDF Full Text Request
Related items