Font Size: a A A

Research On Data Deduplication Technology Based On Hadoop

Posted on:2021-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:L K ZhouFull Text:PDF
GTID:2428330623467811Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
When Hadoop processes certain data,the redundant data will affect the storage effi-ciency of the system and waste the storage resources.Data deduplication technology can effectively identify duplicate files or data blocks in the system,save system storage space,and improve the effective utilization of system resources.Hadoop is the mainstream de-velopment platform of big data in the current era.If data deduplication technology can be applied in Hadoop,it can effectively promote the development of big data technology.At present,the design of deduplication technology on Hadoop focuses too much on the deduplication technology,but it does not fit Hadoop's own characteristics,so it is not suitable for Hadoop.At present,the design has the following disadvantages:firstly,the index file design will double the number of files in the system,which will increase the memory space occupation of the NameNode and affect the system efficiency;Secondly,the system is not compatible with Hadoop Abstract File System,which focuses on file downloading rather than random file reading;Thirdly,the client-server design architecture will make the server to be the system bottleneck,and is not suitable for large-scale data storage.Based on the research of Hadoop,this thesis proposes a new design architecture and index file design method to solve the above problems.The specific contents are as follows:1)In this thesis,we propose a new deduplication file system architecture.It will have the deduplication function on the original HDFS client.In the new client,tthe dedupli-cated file data directly interacts with the data node to avoid the server becoming the system bottleneck in the conventional design.2)Design a new index file.The new index file design can reduce the number of files on HDFS,reduce the use of NameNode memory space,effectively support the ability of random reading system files,and improve system efficiency.3)Based on the above design,a distributed file system prototype is implemented in Hadoop.The prototype system can effectively delete duplicate data,reduce storage space and be compatible with Hadoop Abstract file system.Finally,the prototype system is tested comprehensively,which includes system dedu-plication rate,file read-write speed,concurrent read-write performance,file deletion,etc.The results show that the prototype system has a deduplication rate of 56.4%under the designated data set,and other functions also achieve the expected results.
Keywords/Search Tags:Deduplication, Hadoop, Index file, HDFS, Distributed file system
PDF Full Text Request
Related items