Font Size: a A A

HTDRDedu:The Design And Implementation Of A Distributed Backup Data Deduplication System

Posted on:2017-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:M YaoFull Text:PDF
GTID:2348330512499484Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data backup is an important mean to ensure data security,whether individuals,organizations or companies are required to protect their data.Traditional data backup system based on data deduplication technology in recent years is remarkable to reduce the overhead of storage space,but with the advent of the era of big data,global data volume increased rapidly,which makes traditional data deduplication system face some new challenges,on the one hand,big data need more and more space to store data,leading to increase of storage costs;on the other hand,the emergence of mass data also puts forward higher requirements of the processing speed of the data,namely data processing time is as short as possible,this needs greater data throughput.Aiming to these challenges,this paper proposes a distributed backup data deduplication system-HTDRDedu based on data routing,including two parts:file management and data deduplication storage.File management is mainly responsible for the interaction between the system and users' file management.The system provides users some interfaces of file operations,including the submission and retrieval of a file.The system needs to do some related operations of files submitted by users,including file data chunking,data chunk fingerprint calculation(hash code calculation),data chunk group transmission,data routing and file metadata management.In order to improve the efficiency of data processing,file management is also designed to have function of data deduplication.File management uses Rabin fingerprint,Bloom filter technology etc.Data deduplication storage is responsible for final check of data chunks ensuring that the data chunks be removed if there are duplicates.Data deduplication storage has the function of prefetching hashcodes of data chunks too.Prefetching data chunks includes average sampling and neighbor sampling,the former is used to prefetch data for the data route of the file management,the latter is used to prefetch data for data deduplication on the file management.Finally,data deduplication storage is also responsible for the recovery of users' files.Users get and parse their own metadata files,and then they can retrieve every chunk group of the file from the deduplication storage node to restore file.This paper develops a Java technology-based prototype system HTDRDedu of distributed backup data deduplication which is used for virtual machine disk file,and tests the deduplication ratio and the data throughput of the system.Experimental results show that the data throughput of this system increases significantly compared with all processing nodes query and fixed data routing,and the system maintains good deduplication ratio.
Keywords/Search Tags:storage space, data deduplication, data throughput, file management, data deduplication storage
PDF Full Text Request
Related items