Font Size: a A A

Research On Key Techniques Of Data Deduplication In The Environment Of Big Data

Posted on:2017-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:S LongFull Text:PDF
GTID:2428330488971874Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology that relate to data,the amount of global data is growing explosively.When big data brings abundant benefits to us,it brings a grand challenge to store data.In order to improve the efficiency of storing data,researchers propose the deduplication technique to reduce data.Although the deduplication technique can save a lot of storage space,the performance,reliability and scalability restrict the development of the deduplication technique.Therefore,this paper research the performance of the deduplication technique and propose an index scheme to improve the performance.Further,we design and achieve a new deduplication system.The main research work and contributions of the thesis are as follows:(1)To reduce the frequency of disk access,and improve the read performance,we propose a Secondary Index Assisted Read scheme(SIAR).We first classify fingerprints(the hash code of file chunk)according to their value range.Then,we build a B-tree for each class of the fingerprints.Compared with a B-tree that contains all fingerprints,the height of B-tree for each type of fingerprint is smaller so that the times of random-access on disk are reduced.This improves the read performance of files.Further,we analyze the trade-off between the performance improvement of SIAR and memory overhead,which makes the proposed scheme adaptive to different applications.Finally,we conduct extensive experiments,which confirms the efficiency and efficacy of SIAR.(2)In this paper,we design and achieve a new deduplication system.The system consists the client and server.The client classifies files according to the file type,every type of file uses adaptive the chunking algorithm so that we can reduce the computational overhead in the case of keeping the deduplication rate.The client first transfers the chunk fingerprint to the server in order to detect whether the chunk is new.If the chunk is new data,the client transfers it.Only in this way can we save the broadband cost.For the server,it uses the bloom filter and SIAR to detect the chunk.Finally,we adopt TFS(Taobao File System)to store data in the server.Thus we can use TFS to ensure the reliability and scalability of deduplication system.
Keywords/Search Tags:Big Data, Data Storage, Deduplication, Chunking index
PDF Full Text Request
Related items