Font Size: a A A

Research On Building Efficient Data Deduplication Storage Systems For Data Backup

Posted on:2017-03-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:M FuFull Text:PDF
GTID:1318330485950833Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In big data era, how to efficiently store and manage the huge volume of digital data has become a major challenge facing storage researchers and practitioners. Recent work has reported the common existence of redundant data in various storage systems, such as sec-ondary backup systems and desktop file systems. It would remarkably reduce the storage cost to eliminate the redundant data. Data deduplication, as an efficient emerging com-pression technique, has been increasingly widely deployed in storage systems for cost sav-ings. However, there remain many open problems to be solved, such as chunk fragmentation, large-scale fingerprint index, and storage reliability. Hence, this paper first proposes several approaches to address the chunk fragmentation caused by data deduplication, secondly dis-cusses the impacts of data deduplication on storage reliability, and finally discusses how to build a high-performance and cost-effective deduplication system.The chunk fragmentation caused by data deduplication leads to poor restore performance, and after backup deletions, it also complicates garbage collection. To understand the chunk fragmentation, a fragmentation taxonomy is proposed:the fragmented chunks come from sparse containers and out-of-order containers. The sparse containers directly amplify read operations, and the out-of-order containers would hurt restore performance if the restore cache is not sufficiently large. Currently, deduplication systems employ a rewriting buffer to rewrite fragmented duplicate chunks. Due to the limited buffer size, existing deduplication systems cannot differentiate sparse containers from out-of-order containers, which leads to a significant increase in storage cost and only moderate improvement in restore performance. To address sparse containers, this paper proposes the novel History-Aware Rewriting algo-rithm (HAR). The key observation is that consecutive backup streams are similar, and thus HAR records sparse containers in the backup and rewrites them in next backup. To address out-of-order containers, the Belady's optimal cache is implemented. With a limited restore cache, the optimal cache outperforms the traditional LRU cache. If the self-references are common in backup streams, out-of-order containers cannot be handled by the restore cache. Hence, the hybrid rewriting algorithm is proposed for best restore performance. The Cache-Aware Filter is proposed to reduce the additional storage cost of the hybrid rewriting al-gorithm. To reduce the overhead of garbage collection, the Container-Marker Algorithm (CMA) is proposed. In the FIFO backup deletion scheme, CMA leverages HAR to clear sparse containers, and the time-consuming container merging operation is no longer nec-essary. CMA keeps track of valid containers instead of chunks, and hence it's overhead is proportional to the number of containers rather than chunks. The experiments with four long-term backup datasets conclude that the proposed solutions are better than existing solutions in terms of both storage efficiency and restore performance.Reliability is an important issue in storage systems. Although data deduplication has been widely deployed in storage systems, its impact on reliability remains unclear. Through eliminating duplicate chunks, data deduplication reduces the equipped storage devices such as disks, and thus reduces the probability of seeing disk errors; in the other hand, data dedu-plication increases the consequence of each disk error, because a single corrupted chunk may lead to multiple corrupted files. A quantitative methodology is required to evaluate the reliability of deduplication systems. This paper extends the NOMDL metric by defining the logical magnitude of data loss to support deduplication systems. A reliability simulator improved for data deduplication systems (SIMD) is built. SIMD leverages field data from industry to simulate latent sector errors and whole-disk failures, and generates various data loss events in RAIDs. In order to calculate the logical magnitude of data loss, SIMD derives chunk model and file model from realistic file system snapshots. The analysis and simula-tions with 18 realistic file system snapshots observe that (1) due to the intra-file redundancy, data deduplication reduces the number of corrupted files caused by sector errors, and (2) due to the chunk fragmentation, data deduplication increases the data amount in corrupted files caused by whole-disk failures. In order to improve reliability, the Deliberate Copy Tech-nique (DCT) is proposed. DCT assigns 1% of the physical capacity of a RAID for the copies of most highly-referenced chunks, and always preferentially repairs the dedicated physical area during RAID reconstructions. While the storage cost is moderate, DCT reduces the corrupted chunks and files by whole-disk failures by 31.8% and 25.8% respectively.Not only the fragmentation and reliability issues but also other components (such as fin-gerprint index) are important to build efficient data deduplication storage systems. In order to systematically understand and compare existing and potential solutions, this paper pro-poses a general-purpose deduplication prototype named Destor. A deduplication system is summarized into an N-dimensional parameter space, where each dimension is a submodule or parameter. Many design choices are available for each parameter. Existing and poten-tial solutions are considered as points in the parameter space. Destor implements the entire parameter space, and includes various mainstream deduplication solutions. With Destor, researchers are able to experimentally compare existing solutions, and explore the parame-ter space to design new solutions. In order to design more efficient deduplication systems, an extensive experimental exploration with three long-term backup datasets is conducted. The concerned performance metrics include memory overhead, storage cost, backup per-formance, and restore performance. The aim is to find the solutions with sustained high backup performance and a reasonable tradeoff among the three remaining metrics. The ex-periments contribute 17 findings, and conclude with four recommended solutions that meet the requirements. For lowest storage cost, the exact deduplication exploiting logical locality is recommended; for lowest memory overhead, the near-exact deduplication exploiting either logical or physical locality is recommended; for high sustained restore performance, the ex-act deduplication exploiting physical locality with an efficient rewriting algorithm (such as HAR) is recommended. For higher reliability, DCT is a promising solution at a slight cost of storage without compromising backup and restore performance.
Keywords/Search Tags:Data Deduplication, Chunk Fragmentation, Rewriting Algorithm, Restore Algorithm, Garbage Collection, Fingerprint Index, Reliability
PDF Full Text Request
Related items