Research And Design Of Data Deduplication System For Distributed Cluster In Cloud Storage

Posted on:2015-10-16

Degree:Master

Type:Thesis

Country:China

Candidate:Z R Li

Full Text:PDF

GTID:2308330473453715

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As the age of Big Data is coming, there is an explosion of data, and the massive data storage has become the primary problem of data center. A large number of duplicate data is found in all aspects of information processing and storage in many application scenarios, such as file system, email attachments, web objects, operating system, and application software. Several traditional technologies for data protecting, such as periodic backup, version control and snapshot, also increase the amount of duplicate data, and thus increase the occupation of network bandwidth and storage resources. To improve the utilization of storage resources and reduce the cost of data management, data de-duplication technology has already become a research focus of enterprises and data centers.Cloud storage has properties of high reliability, high universality, high scalability and large capacity, so the research on cloud storage keeps up with the development of computer technology, and has high application value. While building a large-scale and high-performance distributed de-duplication system in the cloud, there are many advantages to go alongside the challenges. In this thesis, we design a set of online cluster deduplication system architecture, and perform copious related research on the data routing strategy and optimization of index query. The contribution of this paper is as follows.(1) Based on the HDFS, we design H-Dedup -- a distributed file system with data de-duplication. According to the features of data de-duplication technology, we reasonably design the system structure and dive the software function into several modules, making the de-duplication technology better applied in the architecture of cluster storage.(2) For de-duplication, we design a similarity routing algorithm for distributing data. The routing unit is a superblock, and based on the similarity theory, we take samples of the superblocks, and choose a small amount of fingerprints to represent the superblocks. With a stateful routing strategy, we match the fingerprints of the superblocks to rapidly position their storage location. In this way, the consumption of network bandwidth is reduced. Therefore, we can reasonably distribute the superblocks to some storage nodes to obtain a high rate of duplicate removal, while maintaining a high storage performance and throughput at the same time.(3) In order to alleviate the disk bottleneck in the process of index query, we design a similarity index table in memory for performing partial data de-duplication to reduce the random disk read and write. According to the data locality, we design a global LRU cache to reduce disk access. In addition, we design an index of hot fingerprints based on the frequency of access to the containers, to increase the rate of duplicate removal in a single node.

Keywords/Search Tags:

Cloud storage, Data deduplication, Data redundancy, Cluster Storage, Distributed File System

PDF Full Text Request

Related items

1	HTDRDedu:The Design And Implementation Of A Distributed Backup Data Deduplication System
2	Research On Multi Cloud Dynamic Security Storage Technology
3	Research On Reliability And Security Of Data Storage Technology In Cloud Computing
4	Research On The Data Deduplication Strategy Of File Blocks For Cloud Storage
5	Research On Key Technologies Of Resources Management In Cloud Storage System
6	Research On Security Technologies Of Data Deduplication For Cloud Storage Systems
7	Research On Safe And Efficient Data Deduplication Scheme In Cloud Storage
8	Research On Data Deduplication Technology In Network Storage System
9	Secure And Dynamic Audit Cloud Storage System With Deduplication
10	A Fast And Secure Data Deduplication Method In Cloud Storage System