Research On Deduplication Technology Based On Hadoop Distributed Platforms

Posted on:2018-02-27

Degree:Master

Type:Thesis

Country:China

Candidate:R Tao

Full Text:PDF

GTID:2428330515455675

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of IT technology,enterprise data volume is growing,Mr.Ma Yun,the CEO of Alibaba,once has a word "human beings from the IT era to the DT era",which clarifies the importance of the data.Increasing data also spawned the birth of large scale distributed computing frameworks such as Hadoop and Spark,so that massive data processing problems can be resolved.Due to the exponential growth of data,the enterprise management to save data costs and data center energy consumption turns increasingly important.However,existing distributed platforms like Hadoop only focus on extending storage capacity without considering optimizing storage space.Studies have shown that in the case of massive data storage in Hadoop,it will result in a lot of duplicate data with a data redundancy of 70%to 80%.Taking into account the issue of high redundancy,we introduced the deduplication technology into Hadoop,and applied dedup-detection operation on a large amount of duplicate data generated by system archiving to ensure the uniqueness of data storage and reduce data redundancy.The combination of deduplication technique and Hadoop is significant as it not only exhibits scalability but also guarantees data storage uniqueness and meet the growing requirement of data storage.With perception of the above issues,this thesis firstly studies the state of art of data processing in Hadoop and the deduplication technology.Targetting at reducing the large amount of deduplicated data in Hadoop,this thesis proposes a Hadoop-based deduplication architecture.The main contribution of this thesis is proposing a new fast file aggregation scheme named PHAF for the Hadoop input file.In addition,the SHA-3 winning algorithm Keccak is implemented as the fingerprint algorithm in the repeated data detection to replace the traditional MD5,SHA-1 and SHA-2etc.The experimental results show our approach significantly outperforms the traditional security fingerprint algorithm SHA-224.Finally,we verify the effectiveness of the proposed approach by applying it to mooc site project,where Hadoop is used to store log and picture data.The experiments with various setting of data block sizes have been carried out and the experimental results are analyzed.

Keywords/Search Tags:

Big data, Hadoop, Deduplication

PDF Full Text Request

Related items

1	Research On Data Deduplication Technology Based On Hadoop
2	Research Of Data Deduplication Technology On Hadoop Distributed System
3	Research Of Data Deduplication In Data Disaster Tolerance Systems
4	Research On Data Encoding Optimization And Data Deduplication In Cloud Storage
5	Research On Duplicate Data Detection In Data Deduplication
6	Research On Key Technologies Of Data Deduplication For Backup System
7	The Research And Design Of Public Sentiment Publishing Platform Based On Hadoop
8	Study On Data Deduplication Technique For Data Backup Systems
9	HTDRDedu:The Design And Implementation Of A Distributed Backup Data Deduplication System
10	Research On NGS Data Processing Algorithm Based On Hadoop Platform