Font Size: a A A

Research On Key Technologies Of Data Deduplication For Backup System

Posted on:2019-07-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z SunFull Text:PDF
GTID:1368330611493111Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of Big Data,data is growing in an explosive speed that is never seen.Now the overall global data has exceeded 10 EB,and a single data-center now is storing data at EB level.Such a high growing speed of data has brought more challenges to storage system,during which,how to protect bag data effectively is becoming a hot research topic.As the most common way of data protection,backup system will copy the whole data-set which will bring severe storage consumption.Research results show that there are lots of redundancies in data universe,especially in backup systems,where the data redundancies can easily reach 80%.As one of the data reduction techniques,data deduplication is an effect way to detect and delete redundant data,it can improve the storage efficiency as well as reduce the management cost of storage system.At the mean time,for backup systems,data deduplication can reduce the backup window.For remote backup systems,it can reduce the data that is transferred through the net.As a result,data deduplication is widely used in backup systems.However,there still many challenges in adopting data-deduplication into backup systems.First of all,the design of deduplication systems should change as they are deployed for different storage environments.But for backup systems,there lack a long-term data deduplication aimed research of backup data-set.Secondly,when the data-set is too large to store in a single node,adopting cluster data deduplication will face the storage node island problem,which means same data exists in different storage nodes.We need to design a reasonable data routing algorithm to assign chunks to keep a high data dedupication ratio as well as keep the storage nodes balanced.Third,long-term backup will lead to fragmentation,which will affect the write and restore speed of a backup system.Forth,in previous data-routing algorithms,backup data is processed in the form of data-stream,while the users' information is not considered.This often leads to high network overhead.In this thesis,we will study the current trend in adopting data deduplication technique into backup system.By studying the backup system data set,we bring new data-routing algorithms and do implementation.The main work and contributions of this thesis are listed in 3 aspects.(1)We collected and published FSL-Homes data-set,give a deep research and test of this data-set from the angle of deduplication and found various valuable conclusions.This data set has a span of 2.5 years,including 32 users,over 4000 snapshots and the size is more than 450 TB with various meta-data.The analysis includes both single node and cluster deduplication.In single node test,we found that due tue high deduplication ratio,using a smaller chunk size may decrease the deduplication ratio,as it will bring more meta-data.Whole file chunking is inefficient in deduplication ratio,mainly because large files which occupies most of the space has low deduplication ratio when us-ing whole file chunking.When testing the data-set in users' angle,we found that although these users have the same working environment,their data sets have different chareacters.For example,different users' data set have different sensitivities in chunking size and their deduplication ratio.In cluster deduplication analysis,we first implemented 7 typical data routing algorithms,then we test their performance in cluster deduplication ratio,load-distribution and network overhead using FSL-Homes.Results show that using file as the routing unit can improve deduplication ratio,but will also lead to data screw among storage nodes,mainly because different file types varies significantly in size.Using larger chunking size will improve the overall performance of data-routing,because its routing as well as deduplication overhead will be decreased,while its influence on deduplicaition ratio is negligible.Due to different frequencies of different chunks,the logical distribution is not always same with physical distribution.Through the analysis the FSL-Homes data set,we came up with many valuable suggestions for the design of deduplication strategies in future backup systems.(2)Proposed User-Info based cluster deduplication architecture and optimize the system from data-routing and hash indexing angle.The data-routing algorithm is the first to utilize the users information to direct data-routing.By analyze FSL-Homes data-set,we found different users have obvious grouping trend,which means users in the same group has a strong overlap while in different groups,the redundancies among uers in negligible.To realize efficient user grouping,we studied the shared data among users in the same group,and found that they have much higher appearance that other chunks,while the shared data between different users in the same group are similar.Based on this conclusion,we designed user information based data routing algorithm.We built a hot-chunk index in each storage node,and calculate the samilirity between super-chunk and this index.By utilizing the similarity of different users,this algorithm can route similar users' chunks to same storage node,which can help improving cluster deduplication ratio,even at a large super-chunking size.In the hash index procedure,we utilize the similarity of continuous snapshots and design a file-recipe based index algorithm.The evaluation results showed that the deduplication ratio is higher than most routing algorithms,the network overhead is significantly reduced compared with other stateful ones while the write speed is higher.(3)Proposed a high scalability,low network overhead data routing algorithm,DS-Dedup for data-sets that contains no user's information.DS-Dedup built a a super-chunk level similarity index in each client to utilize the similarity of the backup stream.For a new super-chunk,when the similarity between a handprint and the similarity-index has reached a threshold,as well as the it's the highest similarity results in each storage nodes,the superchunk will be routed to the storage node directly with out first routing the whole superchunk to all candidate nodes,in this way,the network overhead in data-routing is avoided.For super-chunks that has lower similarity than the threshold,ew will route them using consistent hash,to keep the system scalable.Evaluation results showed that DS-Dedip can significantly reduce the network overhead while keep a high cluster deduplication ratio.Through the research of the data deduplication technique in backup systems in the big data environment as listed above,we provide powerful technical support for the design of adopting data deduplication in backup systems efficiently.
Keywords/Search Tags:Big Data, Backup Systems, Data Deduplication, Data-set Analysis, Cluster Deduplication, Hash-Indexing, Data Routing Algorithm
PDF Full Text Request
Related items