Font Size: a A A

Research On High-performance Fine-grained Deduplication For Backup Storage Systems

Posted on:2024-01-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y ZouFull Text:PDF
GTID:1528307376486054Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the current digital age,backup storage systems are facing huge challenges.On the one hand,the widespread use of data-driven applications makes users demand more and more storage system capacity;on the other hand,the proliferation of cyber attacks and ransomware increases the need for data backup.Therefore,it becomes a critical issue for backup storage systems to effectively compress the size of backup data.Fine-grained deduplication is a technique to effectively reduce the size of backup data,which is a system-level compression scheme for ultra-large-scale data.Specifically,it achieves global redundancy elimination: removing identical chunks and the same parts between similar chunks.Thus,it could significantly increase data storage density and reduce backup storage costs.However,the compression and decompression speed of finegrained deduplication is quite slow and cannot meet the demand of periodic backing up of large-scale data.The main reasons for the slow speed of fine-grained deduplication are twofold:(1)the extra computation and I/O overhead introduced by delta encoding seriously drag down fine-grained deduplication’s processing speeds,which is far from meeting the actual performance requirements of backup storage;(2)it is difficult to know whether it is worth deploying fine-grained deduplication for the backup data generated by a specific scenario and how to configure suitable parameters.In this dissertation,several techniques are proposed to solve the above performance issues respectively,and the main contents are described as follows.In the backup saving and deduplication process,resemblance detection causes a huge computational overhead.After analyzing the main computational overheads of widely used approaches,this dissertation proposes a new approach Odess,which reduces the overall computational complexity by introducing a high-performance rolling hash algorithm and a content-based sampling method.Finally,evaluations on the real-world and synthetic datasets show that the proposed method achieves better processing speed while maintaining very close detection accuracy.When storing and managing the deduplicated data,the data suffer from fragmentation caused by chunk-sharing between backups.This dissertation proposes a lifecyclebased data layout management mechanism,MFDedup.MFDedup examines the correlation between data chunks from a lifecycle perspective(referring to which backups infer to a certain chunk)and plans the relative locations of data chunks on the storage media:chunks with the same lifecycle are grouped into the same category and stored sequentially,while related categories are stored in order.This mechanism ensures that adjacent data chunks within the same category have strong locality,and they are always necessary or not necessary when restoring a certain backup.Thus,the first type of fragmentation problem,caused by data sharing(data deduplication)between different backups,is solved.Experiments on real-world datasets verify the effectiveness of the proposed data layout and the feasibility of the data layout’s evolution algorithm.In the data recovery and restore process,the data suffer from fragmentation caused by delta encoding between similar chunks.This dissertation proposes Me GA,a locality mining framework,to address this problem.The framework exploits the locality among reference chunks in delta coding to reduce the average cost of reading reference blocks from disks.Besides,the framework exploits the implied temporality between reference chunks and delta chunks,as well as the difference in I/O characteristics between different storage media,and finds a special data accessing path during data recovery,which transfers random I/O operations from backup storage media(e.g.,HDD,etc.)to user storage media(e.g.,SSD,etc.).The above two designs reduce the I/O overhead during data backup and data recovery,respectively.Finally,evaluations on real-world backup datasets show that the proposed framework effectively reduces the I/O overhead.When the backup system is deployed,fine-grained deduplication suffers from the difficulty of parameter selection.This dissertation proposes a prediction framework for the fine-grained deduplication’s compression ratio,DECT,to address this issue.It can quickly predict a compression ratio for a specific fine-grained deduplication configuration,thus providing a reference for system design and parameter selection.On the one hand,DECT uses a prediction method for delta coding,which is based on chunks’ similarity fingerprints;on the other hand,it applies a learning-based parameter inference that avoids sampling skewness implied by the similarity fingerprint generation and thus achieves accurate and fast prediction.Evaluations on real-world datasets show that the proposed framework has high accuracy,speed,and sensitivity to different parameter configurations.
Keywords/Search Tags:Deduplication, Backup Storage, Resemblance Detection, Maintaining Locality, Performance Optimization
PDF Full Text Request
Related items