Font Size: a A A

Research On High I/O Performance Data Deduplication In Primary Storage System

Posted on:2015-08-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:B LinFull Text:PDF
GTID:1108330479479533Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive data growth in enterprises, the needs of massive data storage capacity have developed from tens of gigabyte to hundreds of terabyte, even petabyte magnitude. Faced with ever-growing storage needs, corporations have to continuously purchase a large number of storage devices to maintain their data services, which has brought much pressure on the infrastructure and storage management. With the in-depth research of storage systems, data deduplication, serving as a storage-optimization technology that reduces the data footprint by detecting and eliminating multiple copies of redundancy,has been widely used for industries such as financial services, pharmaceuticals, and telecommunications. Data deduplication can efficiently optimize the storage space usage,enhance network bandwidth utilization and reduce operating costs, and rapidly becomes a major research focus in recent years.Existing work in the field of data deduplication is mostly on backup and archival systems for saving storage usage and optimizing I/O performance. While as the rise of social network and cloud storage services, many primary storage workloads(e.g., email,user directories, databases) have high data redundancy and can greatly benefit from data deduplication. However, differ from the backup and archival system, primary storage system has large amount of data, complicated data access pattern and high I/O performance service requirement, which pose serious challenges to deploy data deduplication due to the associated latency costs. Most existing deduplication methods suffer the following problems:First, existing algorithms overlook the characteristic of duplicated writes in primary storage. Most work focus on backup and archival storage where the data access pattern is”once write, read later”. While in primary storage, its I/O writes are far more complicated that previous algorithms can hardly be utilized to efficiently perform data deduplication.Second, existing methods only focus on designing algorithms with low computation complexity, while ignoring the parallelism of duplicate detection. They also take no advantage of existing heterogeneous parallel computing frameworks, such as CPU+GPU, for acceleration. In fact, duplicate detection can be performed in parallel to overlap the running time and thus the efficiency of the I/O intensive task can be dramatically improved. Third,previous studies stand only on the underlying storage perspective that data deduplication brings disk fragments and impair the reading performance. While, as a matter of fact,deduplication offers unique opportunities for reading performance optimization by more possible cache hits. In certain circumstances, deduplication can not only save the storage, but also reduce the read latency. Therefore, it needs to consider the affections of all layers in the storage hierarchy on the deuplication system design. Forth, previous work overlook the dynamic characteristics exhibited during the accesses of duplicated data. To reduce the impact of fragments, traditional thought of defragmentation has been widely used in many operating systems. Unfortunately, file defragmentation ignores the dynamic access pattern of duplicated data. As a result, it is highly constrained by block sharing in deduplication system which makes it hard to optimize the reading performance.To well address the above challenges, this thesis systematically investigates some key issues of I/O performance oriented data deduplication in primary storage system. The research mainly focuses on the following aspects.In order to reduce the writing cost in deduplication system, this thesis in-depth investigates the characteristics of real-world I/O workloads that highly data skew exists in the access patterns of duplicated data. Motivated by this, we propose Leach which adopts a splay tree to organize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cache memory, with a goal to service a majority of duplicated data detection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fingerprint index and cache updates.For parallel processing of duplicate detection, we present the design of G-Paradex,a novel parallel deduplication framework. Utilizing a prefix tree to organize the chunk fingerprints, G-Paradex adopts GPU to search the target tree in parallel. Leveraging the inherent chunk locality in writing data stream, G-Paradex group consecutive chunks and extract the handprints into the prefix tree, aiming at shrinking the index size and reducing the on-disk accesses. Exploiting the characteristics of GPU architecture, G-Paradex perform speculatively pruning of the index tree transferred to GPU memory for processing,alleviating the device(GPU) memory bottleneck and accelerate the duplicate detection.To deal with the disk fragments introduced by data deduplication, we investigate the impact of cache on reading performance after deduplication and observe from realworld traces that deduplication does not always impair the reading, but has the opportunity to optimize the reading performance by more possible cache hits. Motivated by this,we propose a novel cache-aware deduplication scheme Care Dedup to well leverage the new opportunities. Based on a uniform locality assessment algorithm design, Care Dedup selects the most profitable duplicated blocks to deduplicate for maximizing the reading performance.To reduce the impact of fragments, traditional thought of defragmentation is highly constrained by block sharing in deduplication system which makes it impossible optimize on-disk layout. In this thesis, we propose Re Dedup which is motivated by the observation of real world I/O workloads: non-uniform access frequency distribution of duplicated data. Leveraging this data skew, Re Dedup dynamically estimates the access randomness and block sharing relationship, of individual files based on the history I/O activity, making a majority of I/O requests more sequential by eliminating the fragments of hot files and shifting them to the rarely read files.To summarize, our works present solutions to several essential issues of data deduplication. Comprehensive experiments demonstrate that proposed algorithms can properly achieve their design goals.
Keywords/Search Tags:Primary Storage, Data Deduplication, Duplicate Detection, disk Fragment, Fingerprint Cache, Parallel Processing, Cache-aware Deduplication, Data Reallocation
PDF Full Text Request
Related items