Font Size: a A A

Research On Duplicate Data Detection In Data Deduplication

Posted on:2018-06-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:P F ZhangFull Text:PDF
GTID:1318330515983382Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the data generated by vari-ous applications also show explosive growth,which brings great challenges to the data storage and backup.As a global redundant data elimination technology,deduplication has been extensively studied by academics and storage industry.Especially,deduplication has become one of the most critical technologies in backup system.Compared with the traditional data compression technology,deduplication technology focuses on finding redundant data in the global system and then identifies and eliminates the redundant data at the coarse-grained level.Deduplication can be more efficient than the compression technology in terms of identifying and cleaning up the global redundant data.Deduplica-tion technology can not only largely reduce storage medium and energy cost of data storage,but also effectively reduce the required network bandwidth and improve the speed and efficiency of data transmission.Data may be deployed in a variety of scenarios,but because of the complexity of data type,designing an efficient data deduplication sys-tem also faces many challenges.Especially for the duplicate data detection technology,the indicator of deduplication ratio and performance directly affect the whole perfor-mance and effect of data deduplication.Because the data deduplication is used in differ-ent application scenarios,in which the data set characteristics are quite different,one du-plicate data detection algorithm cann't to optimiz all type of data sets.We keen on dupli-cate data detection algorithm on these issues in this paper.Based on redundancy,data can be roughly classified into two categories.One cate-gory is highly redundant data set,which is mainly generated by backups,snapshots,ar-chives,and so on.In such data sets,the data usually exhibit strong similarity and locality.In some data sets,such as incremental backups,the data exhibits strong data locality,while in other data sets,such as full backup,snapshot shows data similarity.The other category is conventional data sets with relatively low redundancy,which contain less du-plicate data.Based on the above classification,this paper proposes three different algo-rithms to optimize the retrieval of duplicate data according to the characteristics of dif-ferent data sets.With respect to the data set with strong locality,this paper proposes HsDedup.It utilizes Bloom filter,hash table and various caching mechanisms to fully exploit the temporal and spatial locality of data,which can improve the retrieval efficiency of du-plicate fingerprints and the accuracy of fingerprint prefetching.Specifically,for possible duplicate data blocks in the data stream,HsDedup uses bloom filter to predict the re-peatability of the data block,next according to the results of prediction and different conditions,checking buffer,heat zone and cold zone of the cache and the disk for dupli-cate data detection.Meanwhile,it takes advantage of locality of data,improving detec-tion effectiveness of duplicate data.From the test result of HsDedup,HsDedup has better performance than existing solutions on the retrieval of duplicate data.With respect to the data set with strong data similarity,this paper proposes an effi-cient fingerprint retrieval scheme:RMD.RMD uses Bloom filter array and data similar-ity algorithm to effectively reduce the scope of fingerprint retrieval.Specifically,RMD uses data similarity theory to locate similar data segments quickly,and uses fingerprint bins to absorb and aggregate fingerprints in similar segments.The fingerprint samples in the fingerprint bins continue to gather,which helps to improve the deduplication effi-ciency.Meanwhile,the disk access frequency is reduced and the efficiency of fingerprint retrieval is improved.Experiments show that this scheme has similar deduplication effi-ciency with other existing schemes;however,it has much higher fingerprint retrieval performance.With respect to the data sets lack of locality and similarity,this paper proposes a creative strategy which aims to streamline the index table and is applied to the resource scheduling in the online deduplication system.Specifically,by increasing the retrieval of high-frequency fingerprints in the source system,it can reduce the amount of data trans-mitted online.In the inline-deduplication system,low-frequency fingerprints are trans-ferred to post-deduplication servers,while high-frequency fingerprints are preserved.We designed an online data deduplication scheme and tested the fingerprint clean-up process in this scheme.The test result shows that,through the transfer of low-frequency finger-print from the inline-deduplication system,80%fingerprints in the fingerprint index table can be removed,thus the overall performance can be two to three times than before.Meanwhile,when the low-frequency fingerprints are removed,the relative deduplication ratio in the inline-deduplication remains at more than 92%.The remaining 8%(the miss-ing duplicate data)can be retrieved by the post-deduplication servers.
Keywords/Search Tags:Deduplication, Redundancy Elimination, Key Value System, Data Locality, Frequency based Deduplication, Data Backup
PDF Full Text Request
Related items