Font Size: a A A

Research On Performance Optimization Based On Container Characteristics In Deduplication-based Backup Systems

Posted on:1022-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:D T ZhangFull Text:PDF
GTID:2428330647960084Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid growth of data volume poses severe challenges to limited storage space.Data deduplication technology effectively identifies and deletes duplicated data chunks,greatly reducing the space required for storing data and the bandwidth required to transfer data,so it is widely used in backup and archive systems.However,the diskbased index lookups and data fragmentation in the deduplication-based backup systems impair the performance of data backup and recovery,respectively.As a basic unit for retaining the locality of the backup streams,the container has a close relationship with the data backup and recovery process.Based on several characteristics of the container,this paper proposes two methods to optimize the performance of the deduplicationbased backup systems.The main works and innovations are listed as follows:Improving the performance of deduplication-based backup systems via container utilization based hot indexes distilling.After research and experimental demonstration,we find that:(i)during the backup process,only a small part of the indexes are frequently accessed,and the large number of indexes are rarely accessed;(ii)the container utilization can reflect the frequency of the indexes being accessed.In this regards,we proposed Hot Indexes Distilling(HID).HID removes cold indexes from global indexes and only retains hot indexes in memory,which greatly improves the hit ratio of index lookups.In addition,HID also slightly improves the data recovery performance.HID introduces a new feature named SDTU.The SDTU feature refers to the phenomenon that a small number of duplicated chunks transform into unique chunks.The SDTU compensates for the shortcoming that the Bloom filter cannot identify unique chunks.In order to fully leverage the features of STDU and Bloom filter for improving the backup performance,we finally proposed the evolutionary version of HID,EHID.EHID integrates a Bloom filter into itself,and only maps the hot indexes into the Bloom filter.EHID has two salient features:(i)avoiding the disk I/Os triggered by identifying the unique chunks;(ii)reducing the false positive rate of the Bloom filter.These two features make EHID always work efficiently.Approximate optimal rewriting algorithm based on container reference rate.The traditional rewriting algorithms sort the containers within a single data segment in descending order according to the container reference rate,and select the several containers with the lowest reference rate within the range.However,when expanding the scope of a single data segment to that of multiple data segments or the entire backup stream,we find that these containers are not the optimal containers,i.e.not the containers with the lowest container reference rate.In this regard,we propose the approximate optimal rewriting algorithm named OPT,which retains the distribution of container reference rates of multiple data segments through a hash bucket array to select the approximately optimal containers.Not only that,OPT has two working modes: optimal rewriting mode and aggressive rewriting mode,optimal rewriting mode is designed to improve the deduplication ratio,aggressive rewriting mode is designed to improve recovery performance.OPT achieves a good tradeoff between deduplication ratio and recovery performance by adaptively switching the working mode.
Keywords/Search Tags:Data deduplication, Backup system, Disk bottleneck, Fragmentation, Rewriting, Data recovery
PDF Full Text Request
Related items