Font Size: a A A

High Performance Data Deduplication Mechanisms For Data Centers

Posted on:2020-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2428330599951316Subject:Engineering
Abstract/Summary:PDF Full Text Request
Deduplication technology is an effective means to remove redundant data,which can save storage space and reduce storage overhead for large-scale data storage.With the development of data centers,deduplication has been widely concerned and applied.Nowadays,the prominent characteristics of data center are distributed architecture,large scale and high data redundancy,which brings more challenges to deduplication.Therefore,it is time to improve throughput by high concurrent deduplication.Because of multi-stream storage in data center,the technology of deduplication based on sorting index has been preliminarily studied.It has the characteristics of easy expansion and high parallelism,so this technology can greatly improve throughput.After analysis,we find that there are two problems in multi-stream deduplication with sorted index: 1.uneven resource allocation among multiple clients may lead to a decline in the performance of a single client;2.the performance of deduplication can be affected by multi-stream in parallel,because it makes fingerprints too scattered to keep the locality of data streams.In this regard,we propose corresponding algorithms to effectively solve these problems and further optimize the performance of deduplication with sorted index.Specific research contents are as follows:1)A schedule algorithm based on fingerprint page distribution is proposed for fingerprint checking.Firstly,the information of the page read by the streams is obtained though experiments.Analyzing this information,we divide the fingerprint page distribution into four categories,each with different delay.Subsequently,a classifier is designed to predict the fingerprint distribution of each stream in accordance with the fingerprint size before fingerprint checking.Next,according to the predicted fingerprint distribution type,the priority is set for each data stream.The priority determines which fingerprint page to read,which changes from sequential reading to on-demand reading to optimize throughput.Finally,the sufficient experimental results under real data set demonstrate the effectiveness of the algorithm,which not only guarantees the overall performance of parallel multiple data streams,but also optimizes the streams of single client with long latency.2)A multi-stream parallel detection algorithm based density is proposed.Firstly,calculating the fingerprint difference,we find the fingerprint-intensive area of each single data stream.Following it,all the fingerprint-intensive areas of data streams form a set,and then we get the common fingerprint-intensive area by the set.Next,only the fingerprints of streams in common area are checked,and other fingerprints are left with the new fingerprints for a new round.Finally,the experimental statistics of the number of fingerprint pages read in the process of deduplication show that the algorithm reduces the number of fingerprint pages read repeatedly,thus greatly improving the deduplication throughput.
Keywords/Search Tags:deduplication, multi-stream, concurrency, sorted index
PDF Full Text Request
Related items