Font Size: a A A

Research And Design Of Data Deduplication System For Distributed Cluster In Cloud Storage

Posted on:2015-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z R LiFull Text:PDF
GTID:2308330473453715Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the age of Big Data is coming, there is an explosion of data, and the massive data storage has become the primary problem of data center. A large number of duplicate data is found in all aspects of information processing and storage in many application scenarios, such as file system, email attachments, web objects, operating system, and application software. Several traditional technologies for data protecting, such as periodic backup, version control and snapshot, also increase the amount of duplicate data, and thus increase the occupation of network bandwidth and storage resources. To improve the utilization of storage resources and reduce the cost of data management, data de-duplication technology has already become a research focus of enterprises and data centers.Cloud storage has properties of high reliability, high universality, high scalability and large capacity, so the research on cloud storage keeps up with the development of computer technology, and has high application value. While building a large-scale and high-performance distributed de-duplication system in the cloud, there are many advantages to go alongside the challenges. In this thesis, we design a set of online cluster deduplication system architecture, and perform copious related research on the data routing strategy and optimization of index query. The contribution of this paper is as follows.(1) Based on the HDFS, we design H-Dedup -- a distributed file system with data de-duplication. According to the features of data de-duplication technology, we reasonably design the system structure and dive the software function into several modules, making the de-duplication technology better applied in the architecture of cluster storage.(2) For de-duplication, we design a similarity routing algorithm for distributing data. The routing unit is a superblock, and based on the similarity theory, we take samples of the superblocks, and choose a small amount of fingerprints to represent the superblocks. With a stateful routing strategy, we match the fingerprints of the superblocks to rapidly position their storage location. In this way, the consumption of network bandwidth is reduced. Therefore, we can reasonably distribute the superblocks to some storage nodes to obtain a high rate of duplicate removal, while maintaining a high storage performance and throughput at the same time.(3) In order to alleviate the disk bottleneck in the process of index query, we design a similarity index table in memory for performing partial data de-duplication to reduce the random disk read and write. According to the data locality, we design a global LRU cache to reduce disk access. In addition, we design an index of hot fingerprints based on the frequency of access to the containers, to increase the rate of duplicate removal in a single node.
Keywords/Search Tags:Cloud storage, Data deduplication, Data redundancy, Cluster Storage, Distributed File System
PDF Full Text Request
Related items