Font Size: a A A

Research On Data Garbage Collection Technique In Data Deduplication Based Backup Systems

Posted on:2019-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:T LiuFull Text:PDF
GTID:2428330566977019Subject:Engineering
Abstract/Summary:PDF Full Text Request
In the field of information storage technology,garbage collection in backup systems using data deduplication technology has always been the focus of attention.In the backup system,a retention time is generally set for the backup data,and outdated data should be collected.But after deduplicating the data,only one copy of the repeated data blocks is retained,and each data block is likely to be referenced multiple times by data within the same backup data flow version.It is also possible for data references between multiple backup data stream versions to be multiple times.This multiple reference of the same data block increases the difficulty of retrieving expired data blocks.How to effectively eliminate these invalid data blocks and reuse the storage space occupied by them is a problem to be solved urgently in the backup storage system of the application data deduplication technology.There are two types of garbage data collection methods in existing backup systems based on data deduplication.Respectively,reference count(RC)and mark and sweep(MS).The main idea of reference count is to set a reference count value for each data block in the deduplication system.Each reference to the block will add 1 to its reference count value.By checking whether the reference count value is 0,it can be judged whether the data block is a garbage data block.The mark and sweep method differs from the reference count in that it does not set a reference count value,does not perform any preprocessing on the data in the backup phase,and in the garbage collection phase scans all backup metadata to find unreferenced garbage data blocks.The disadvantages of these two garbage collection methods are obvious.The main disadvantage of the reference count method is its low reliability.Any repeated update or deferred update of the reference count value will cause the value to be incorrect,making the stored/referenced data block in the system inconsistent with the count value,resulting in garbage data recovery errors.The main drawback of mark and sweep is that the scan time of backup data is too long,and the speed of marking garbage data is too slow.Aiming at the shortcomings of existing reference counting and mark and sweep methods,this paper proposes a garbage collection mechanism based on reference time map(Gc_RTM).The mechanism builds a reference time map(RTM)and a container bitmap(CBT)for each storage container in units of storage containers.Combine the reference time map and the container bitmap structure to quickly obtain garbage chunks that can be reclaimed and storage space that can be reused.Compared with reference count,this mechanism uses a reference time map,and does not require a simple addition/subtraction of 1 operation on the reference count value,resulting in higher reliability.Compared with mark and sweep,the method can quickly mark the garbage data to be recovered by using the reference time map and the container bit table of the storage container,and does not need to perform a full scan on the backup data stream,and the garbage data recovery speed is faster.This paper uses a large number of test data sets to perform performance analysis and evaluation of Gc_RTM garbage collection methods.Test results show that compared to reference count and mark and sweep,this method performs better regardless of the time overhead or space performance overhead of garbage collection.The time performance of Gc_RTM is about 20 times faster than that of RC,and 100 times that of MS.With the increase of backup version,the performance is more obvious.In batch version recovery,Gc_RTM performance is even better.In terms of space overhead performance,the Gc_RTM overhead is the smallest,about 1/2 of RC,and 1/3 of MS.With the increase of backup versions,the advantages are even more obvious.In summary,Gc_RTM can effectively improve the time and space performance of garbage collection in backup systems that use data deduplication technology and optimize storage performance.
Keywords/Search Tags:Deduplication, Reference Count, Mark and Sweep, Reference Time Map, Garbage Collection
PDF Full Text Request
Related items