Font Size: a A A

Research On High Performance Redundancy Elimination Techniques For Data Backup Systems

Posted on:2015-08-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:W XiaFull Text:PDF
GTID:1228330428465752Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As the amount of digital data grows explosively in recent years, data deduplication and delta compression are gaining increasing attention as two key technologies of space-efficient approaches in backup storage systems. Compared with the traditional compres-sion approaches, deduplication and delta compression scale well in mass storage system, which helps reduce the requirement of storage space and improve the availability of net-work bandwidth. With the growth of the unstructured data in storage systems, challenges of deduplication and delta compression remain:the disk indexing bottleneck for duplicate&resemblance detection on the growing size of digital data, the computation overhead of deduplication and delta compression in face of the steadily increased storage and network bandwidth and speed. Thus the hottest topic in current research of data reduction is how to effectively and efficiently compress the data in mass storage systems with low-overhead.While the size of deduplicated data grows from TB-scale to PB-scale, the fingerprints of data chunks have to be stored and managed on disk, which incurs unaccepted throughput due to frequently access on-disk indices. To address this problem, SiLo:a near-exact dedu-plication system, was presented by effectively and complementarily exploiting similarity and locality to achieve high performance data deduplication. The main idea behind SiLo is to expose and exploit more similarity by dividing the backup streams into several segments and then extracting their similarity features to represent and indexing them in the RAM, and to leverage locality in the backup stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. Experi-mental evaluation based on real-world datasets shows, by judiciously enhancing similarity through the exploitation of locality and vice versa, SiLo only consumes a RAM capacity re-quired by deduplication indexing that is about1/25and1/10respectively of that consumed by the state-of-the-art ChunkStash and Extreme Binning approaches, while achieving du-plicate detection efficiency of more than99%and maintaining a much higher deduplication indexing throughput. Most existing state-of-the-art deduplication methods remove redundant data at the chunk level, which incurs unavoidable and significant overhead in time due to Rabin-based chunking and SHA1-based fingerprinting. To address this problem, P-Dedupe:a pipelined and parallel deduplication system, was presented to reduce the computation overhead by exploiting parallelism in deduplication system. The main idea behind P-Dedupe is to fully compose pipelined and parallel computations of data deduplication processes (i.e., chunk-ing, fingerprinting, indexing, and writing) by effectively exploiting the idle resources of modern computer systems with multi-core and many-core processor architectures. In ad-dition, its general philosophy of pipelining deduplication and parallelizing hashing is well poised to fully embrace the industry trend of building multicore and manycore processors. Experimental evaluation based on several workloads shows that P-Dedupe speeds up the deduplication write throughput by a factor of2~4via efficiently exploiting parallelism for data deduplication.One of the hottest topics in data deduplication system is how to maximally detect redundancy with low-overhead. To address this problem, DARE:a Deduplication-Aware Resemblance detection and Elimination scheme for compressing backup datasets, was pre-sented by effectively combining data deduplication and delta compression to achieve high data reduction efficiency at low overhead. The main idea behind DARE is to employ a scheme, call Duplicate-Adjacency based Resemblance Detection (DupAdj), by considering any two data chunks to be similar (i.e., candidates for delta compression) if their respective adjacent data chunks are found to be duplicate in a deduplication system, and then further enhance the resemblance detection efficiency by an improved super-feature approach. Ex-perimental results based several datasets show that DARE only consumes about1/4and1/2respectively of the computation and indexing overheads required by the traditional super-feature approaches while detecting2-10%more redundancy and achieving a much higher system throughput for data reduction.Delta compression is an efficient data reduction approach to removing redundancy among similar data chunks and files in storage systems. One of the main challenges facing delta compression is its low encoding speed. To address this problem, Ddelta: a deduplication-inspired fast delta compression scheme, was presented by effectively leveraging the simplicity and efficiency of data deduplication techniques to improve delta encoding/decoding performance. The basic idea behind Ddelta is to accelerate the delta encoding and decoding processes by a novel approach of combining Gear-based chunking and Spooky-based fingerprinting for fast identification of duplicate strings for delta calcu-lation, and exploit content locality of redundant data to detect more duplicates by greedily scanning the areas immediately adjacent to already detected duplicate chunks/strings. Ex-perimental evaluation based on real-world datasets shows that Ddelta achieves an encoding speedup of2.5X~8X and a decoding speedup of2X~20X over the classic delta-compres-sion approaches Xdelta and Zdelta while achieving a comparable level of compression ratio.
Keywords/Search Tags:Redundancy Elimination, Data Deduplication, Delta Compression, DataBackup System, Redundant Data Locality
PDF Full Text Request
Related items