Font Size: a A A

Design And Implementation Of A Backup System Based On Data De-Duplication

Posted on:2011-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:S X CaiFull Text:PDF
GTID:2178360308460899Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the development of information technology level, the importance of data for enterprises becomes more prominent. As the business daily process will produce a lot of production data, in particular, in recent years, it has put forward higher requirements because of the explosive growth of massive data storage capacity of data centers. Statistics show that numerous data improve in companies day by day, but there are many duplicate data between them, for this reason, the data de-duplication technology is proposed. At present, the data de-duplication technology is not only used to improve data storage efficiency and performance, but also has significant theoretical and practical value.This thesis presents a file backup system based on data de-duplication, which can effectively store and compress files to save space also can save the bandwidth, meanwhile, allowing versions of the data in memory on the effective conservation and reduce the disk overhead.This system can be divided into two functional modules:data de-duplication module blocks use to divide files into chunks and delete duplication; performance improvement module uses to implement pre-processing capabilities and load balancing.In the data de-duplication module, in order to avoid file shifting sensitivity, the file block modular design uses a variable-length block mode, so as to ensure various versions of the file divided into a greater similarity between blocks. In the deleting duplication module, Bloom Filter algorithm consumes O(1) time complexity to complete once de-duplication processing, the algorithm is more efficient and faster than the traditional using the database. Although Bloom Filter has false positive rate, however, demonstrated by the theory and experiment showed that when its processing data within a certain range, the size of the false positive rate is still controllable.Performance improvements in the system module, defines a data structure-directory hierarchical hash tree, using the approach to preprocess a backup directory tree by pruning to shorten the backup time. The system's server employs the distributed processing in order to ensure a smaller false positive rate of Bloom Filter, while the controllers by adding MOSS agents balance the requests of the client to ensure services respond.The results show that the system backup capability is obviously better than Rsync and LBFS system in data compression ratio and the bandwidth occupied.
Keywords/Search Tags:data de-duplication, directory hierarchical hash tree, file-chunking, distributed file system
PDF Full Text Request
Related items