Research On Key Techniques Of Data Deduplication In The Environment Of Big Data

Posted on:2017-09-07

Degree:Master

Type:Thesis

Country:China

Candidate:S Long

Full Text:PDF

GTID:2428330488971874

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology that relate to data,the amount of global data is growing explosively.When big data brings abundant benefits to us,it brings a grand challenge to store data.In order to improve the efficiency of storing data,researchers propose the deduplication technique to reduce data.Although the deduplication technique can save a lot of storage space,the performance,reliability and scalability restrict the development of the deduplication technique.Therefore,this paper research the performance of the deduplication technique and propose an index scheme to improve the performance.Further,we design and achieve a new deduplication system.The main research work and contributions of the thesis are as follows:(1)To reduce the frequency of disk access,and improve the read performance,we propose a Secondary Index Assisted Read scheme(SIAR).We first classify fingerprints(the hash code of file chunk)according to their value range.Then,we build a B-tree for each class of the fingerprints.Compared with a B-tree that contains all fingerprints,the height of B-tree for each type of fingerprint is smaller so that the times of random-access on disk are reduced.This improves the read performance of files.Further,we analyze the trade-off between the performance improvement of SIAR and memory overhead,which makes the proposed scheme adaptive to different applications.Finally,we conduct extensive experiments,which confirms the efficiency and efficacy of SIAR.(2)In this paper,we design and achieve a new deduplication system.The system consists the client and server.The client classifies files according to the file type,every type of file uses adaptive the chunking algorithm so that we can reduce the computational overhead in the case of keeping the deduplication rate.The client first transfers the chunk fingerprint to the server in order to detect whether the chunk is new.If the chunk is new data,the client transfers it.Only in this way can we save the broadband cost.For the server,it uses the bloom filter and SIAR to detect the chunk.Finally,we adopt TFS(Taobao File System)to store data in the server.Thus we can use TFS to ensure the reliability and scalability of deduplication system.

Keywords/Search Tags:

Big Data, Data Storage, Deduplication, Chunking index

PDF Full Text Request

Related items

1	Research On Deduplication Technology In Cloud Storage
2	Research On Key Technologies Of Data Storage Management Oriented Continuous Data Protection
3	Research On High Performance Secure Deduplication Technology For Cloud Storage
4	Research On Deduplication Algorithm Based On Similarity And Chunking
5	Research On Similarity-based Secure Data Deduplication In Cloud Computing
6	HTDRDedu:The Design And Implementation Of A Distributed Backup Data Deduplication System
7	Research On Key Technologies Of Resources Management In Cloud Storage System
8	Secure And Dynamic Audit Cloud Storage System With Deduplication
9	Research Of Data Deduplication In Data Disaster Tolerance Systems
10	Research On Building Efficient Data Deduplication Storage Systems For Data Backup