Font Size: a A A

Study On Data Deduplication Technique For Data Backup Systems

Posted on:2013-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J TanFull Text:PDF
GTID:1118330371480981Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the explosive of the data growth, data deduplication has been used as a common compression component in large-scale data backup systems, due to its lossless compression functions and it can get high compression ratios. However, the data deduplication now faces some problems and challenges varying the backup detests. For example, the source deduplication that is used for cloud backup services can not largely reduce the backup times, and the existing deduplication methods can cause many data fragments that affect the deduplication performance.Due to the low bandwidth of WAN (Wide Area Network) that supports cloud backup services, the backup times is in desperate need to be reduced. The existing source deduplication methods, including source global chunk-level deduplication and source local chunk-level deduplication, have been used to remove the redundant chunks before sending them to the remote backup destination. The former can remove the duplicate data across different clients globally but needs long deduplication time while the latter only removes the duplicate data locally within the same client to reduce deduplication time but it only get low deduplication elimination ratio and needs long data transmission time. Both of these two source deduplication methods can not largely reduce the backup time. In this paper, we propose a semantic-aware multi-tiered deduplication framework (SAM) for cloud backup services. SAM combines the source global file-level deduplication and local chunk-level deduplication while at the same time exploits some file semantics, thus to narrow the search space of the duplicate data and reduce deduplication overhead. Compared with the existing source deduplication methods, SAM can get higher deduplication elimination ratio than source local chunk-level deduplication and needs shorter deduplication time than source global chunk-level deduplication, achieving an optimal tradeoff between the deduplication elimination ratio and deduplication overhead and largely shorten the backup window.However, the existing source deduplication methods, including SAM, only focus on removing the redundant data for cloud backup operations, while paying little attention to the restore time. But according to our survey, the restore time is very important and critical to some enterprises that demand high data reliability, due to needing large financial cost while facing data disasters. In this paper, we propose a causality-based deduplication performance booster (CABDedupe) for cloud backup services, which captures and preserves causal relationship among chronological versions of the backup datasets. According to this causality information, CABDedupe can not only remove the redundant data for cloud backup operations to reduce backup time, but also can remove the redundant data for cloud restores to reduce restore time. Moreover, CABDedupe is a middleware that is orthogonal to and can be integrated into any existing backup system, and its failure only cause some redundant data to be kept remained and transmitted for backup/restores, but will not disturb backups/restores themselves or cause their failures.Due to the removal of the redundant data, any deduplication approaches will force the same file or data stream to be divided into multiple parts and cause many data fragments. The data fragmentation would get much more server especially for the long term backups and retentions, significantly affecting the deduplication performance including the deduplication throughput, data read performance and data reliability that is related to the deduplication process. In this paper, we analyze the negative effects of the data fragmentation to the deduplication performance and propose a simple but effective approach to alleviate the data fragmentation, called De-Frag. The key idea of De-Frag is to keep some redundant not removed, thus to reduce the data fragments and preserve data locality. Moreover, De-Frag uses threshold to restrict the amount of the unremoved redundant data. During our extensive experiments driven by real-world datasets, it is shown that De-Frag can effectively improve the deduplication performance while sacrificing little deduplication elimination ratio based on existing deduplication approaches.
Keywords/Search Tags:Data Backup System, Data Deduplication, Source Deduplication, BackupWindow, Restore Time, Data Fragmentation
PDF Full Text Request
Related items