Font Size: a A A

Research And Application On De-duplication Technology For Cloud Storage

Posted on:2016-08-02Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2428330473964988Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology,current data is growing fast so that the scale of data storage is far exceed the development of the physical storage hardware,which has brought tremendous pressure to the storage service center.The study shows that massive amounts of duplicate data exist in the massive storage data.The technology of de-duplication can check the redundant data through fingerprint index,and reduce the physical storage capacity to a few one-tenth of the past.As the result,the utilization rate of physical storage equipment and efficiency of data storage system is greatly improved.It's an effective way to solve the proposed issue.However,the massive data storage generates huge fingerprint scale.If the storage system deletes repeated data totally depending on fingerprint index,the cost will be much higher than advantages of them,so it will affect the performance of storage system.Therefore,the rapid of indexing fingerprint is the key factor that affects the application and development of de-duplication.The dissertation takes the de-duplication technology that adapt to the cluster storage environment as a goal.The dissertation study and analyze the key technology of the current de-duplication.Then based on the status of the bottleneck of the system index performance caused by the data de-duplication,the dissertation puts forward the Sampling De-duplication Based on Similarity,SDBS.The algorithm focuses on cloud storage environment and starting from the target of improving overall performance of system.On the basis of fully guarantee the data de-duplication rate,SDBS algorithm reduces the scope and times of fingerprints index through sampling based on file level.About the question that data de-duplication based on sampling system may reduce the data de-duplication rate of system as well as the master node load overweight,SDBS algorithm maintains the excellent data de-duplication rate though deeply delete duplicate data,when SDBS algorithm reaches the corresponding document similarity threshold.At the same time,SDBS algorithm assigns the task of checking duplicate data to other storage node to ensure that the system has a higher throughput.Finally,this dissertation designs and implements a prototype system(HDFS_SDBS)based on the SDBS algorithm and HDFS environment.And the dissertation presents detailed case,module and algorithm process of design and implementation details.The results show that the SDBS algorithm can speed up the recognition of the rate of duplicate data and improve the overall throughput of the system by sampling method based on file level.In addition,SDBS algorithm further improves the effect of data de-duplication and has better de-duplication rate through parallel distributing fingerprint to multiple nodes.Thus,the algorithm can solve the indexing bottleneck of de-duplication effectively based on cloud storage env ironment,and improve the storage efficiency of the system greatly.
Keywords/Search Tags:cloud storage, de-duplication, index, similarity, sample, HDFS
PDF Full Text Request
Related items