Design And Implementation Of A Backup System Based On Data De-Duplication

Posted on:2011-11-01

Degree:Master

Type:Thesis

Country:China

Candidate:S X Cai

Full Text:PDF

GTID:2178360308460899

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

With the development of information technology level, the importance of data for enterprises becomes more prominent. As the business daily process will produce a lot of production data, in particular, in recent years, it has put forward higher requirements because of the explosive growth of massive data storage capacity of data centers. Statistics show that numerous data improve in companies day by day, but there are many duplicate data between them, for this reason, the data de-duplication technology is proposed. At present, the data de-duplication technology is not only used to improve data storage efficiency and performance, but also has significant theoretical and practical value.This thesis presents a file backup system based on data de-duplication, which can effectively store and compress files to save space also can save the bandwidth, meanwhile, allowing versions of the data in memory on the effective conservation and reduce the disk overhead.This system can be divided into two functional modules:data de-duplication module blocks use to divide files into chunks and delete duplication; performance improvement module uses to implement pre-processing capabilities and load balancing.In the data de-duplication module, in order to avoid file shifting sensitivity, the file block modular design uses a variable-length block mode, so as to ensure various versions of the file divided into a greater similarity between blocks. In the deleting duplication module, Bloom Filter algorithm consumes O(1) time complexity to complete once de-duplication processing, the algorithm is more efficient and faster than the traditional using the database. Although Bloom Filter has false positive rate, however, demonstrated by the theory and experiment showed that when its processing data within a certain range, the size of the false positive rate is still controllable.Performance improvements in the system module, defines a data structure-directory hierarchical hash tree, using the approach to preprocess a backup directory tree by pruning to shorten the backup time. The system's server employs the distributed processing in order to ensure a smaller false positive rate of Bloom Filter, while the controllers by adding MOSS agents balance the requests of the client to ensure services respond.The results show that the system backup capability is obviously better than Rsync and LBFS system in data compression ratio and the bandwidth occupied.

Keywords/Search Tags:

data de-duplication, directory hierarchical hash tree, file-chunking, distributed file system

PDF Full Text Request

Related items

1	Design And Implementation Of A File Backup System Based On Source De-duplication
2	Research On Metadata Access Technology Of Distributed File System
3	Research On A File-level Data Reduplication Approach In Cloud Storage Systems
4	Research On Key Technologies For Data Access In Intelligent Terminals Of The Next Generation Broadcasting Network
5	Research And Implementation On Technologies Of Meta-data Load Balancing In Distributed File System
6	The Design And Implementation Of P2P File Sharing System
7	Research And Implementation Of The Directory File System That Based On FastDFS
8	File Hidden Research And Implementation Based On The FAT32 File System
9	Development Of P2P File Sharing System Based On Enterprise Local Area Network
10	Design And Realization Of Parallel File Io Based On Hadoop Distributed File System