Research On Data Deduplication Technology Based On Hadoop

Posted on:2021-01-28

Degree:Master

Type:Thesis

Country:China

Candidate:L K Zhou

Full Text:PDF

GTID:2428330623467811

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

When Hadoop processes certain data,the redundant data will affect the storage effi-ciency of the system and waste the storage resources.Data deduplication technology can effectively identify duplicate files or data blocks in the system,save system storage space,and improve the effective utilization of system resources.Hadoop is the mainstream de-velopment platform of big data in the current era.If data deduplication technology can be applied in Hadoop,it can effectively promote the development of big data technology.At present,the design of deduplication technology on Hadoop focuses too much on the deduplication technology,but it does not fit Hadoop's own characteristics,so it is not suitable for Hadoop.At present,the design has the following disadvantages:firstly,the index file design will double the number of files in the system,which will increase the memory space occupation of the NameNode and affect the system efficiency;Secondly,the system is not compatible with Hadoop Abstract File System,which focuses on file downloading rather than random file reading;Thirdly,the client-server design architecture will make the server to be the system bottleneck,and is not suitable for large-scale data storage.Based on the research of Hadoop,this thesis proposes a new design architecture and index file design method to solve the above problems.The specific contents are as follows:1)In this thesis,we propose a new deduplication file system architecture.It will have the deduplication function on the original HDFS client.In the new client,tthe dedupli-cated file data directly interacts with the data node to avoid the server becoming the system bottleneck in the conventional design.2)Design a new index file.The new index file design can reduce the number of files on HDFS,reduce the use of NameNode memory space,effectively support the ability of random reading system files,and improve system efficiency.3)Based on the above design,a distributed file system prototype is implemented in Hadoop.The prototype system can effectively delete duplicate data,reduce storage space and be compatible with Hadoop Abstract file system.Finally,the prototype system is tested comprehensively,which includes system dedu-plication rate,file read-write speed,concurrent read-write performance,file deletion,etc.The results show that the prototype system has a deduplication rate of 56.4%under the designated data set,and other functions also achieve the expected results.

Keywords/Search Tags:

Deduplication, Hadoop, Index file, HDFS, Distributed file system

PDF Full Text Request

Related items

1	Optimization And Implementation Of Small File Storage In HDFS Under Hadoop Platform
2	Optimization Study On Storing Massive Small Files Based On Hadoop
3	The Technical Research Of Optimization Of File Storage In HDFS
4	The Research And Analysis Of Hadoop Small File Processing Method
5	Design And Realization Of Parallel File Io Based On Hadoop Distributed File System
6	Design And Realization Of Parallel File IO Based On Hadoop Distributed File System
7	The Design And Implementation Of Massive Small Files Storage System Based On HDFS
8	Research And Implementation Of Mass Small File Based On HDFS
9	Research And Implementation Of HDFS Distributed File System
10	Design And Implementation Of Massive Audio File Storage System Based On HADOOP