Font Size: a A A

Research On Deduplication Technology Based On Hadoop Distributed Platforms

Posted on:2018-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:R TaoFull Text:PDF
GTID:2428330515455675Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of IT technology,enterprise data volume is growing,Mr.Ma Yun,the CEO of Alibaba,once has a word "human beings from the IT era to the DT era",which clarifies the importance of the data.Increasing data also spawned the birth of large scale distributed computing frameworks such as Hadoop and Spark,so that massive data processing problems can be resolved.Due to the exponential growth of data,the enterprise management to save data costs and data center energy consumption turns increasingly important.However,existing distributed platforms like Hadoop only focus on extending storage capacity without considering optimizing storage space.Studies have shown that in the case of massive data storage in Hadoop,it will result in a lot of duplicate data with a data redundancy of 70%to 80%.Taking into account the issue of high redundancy,we introduced the deduplication technology into Hadoop,and applied dedup-detection operation on a large amount of duplicate data generated by system archiving to ensure the uniqueness of data storage and reduce data redundancy.The combination of deduplication technique and Hadoop is significant as it not only exhibits scalability but also guarantees data storage uniqueness and meet the growing requirement of data storage.With perception of the above issues,this thesis firstly studies the state of art of data processing in Hadoop and the deduplication technology.Targetting at reducing the large amount of deduplicated data in Hadoop,this thesis proposes a Hadoop-based deduplication architecture.The main contribution of this thesis is proposing a new fast file aggregation scheme named PHAF for the Hadoop input file.In addition,the SHA-3 winning algorithm Keccak is implemented as the fingerprint algorithm in the repeated data detection to replace the traditional MD5,SHA-1 and SHA-2etc.The experimental results show our approach significantly outperforms the traditional security fingerprint algorithm SHA-224.Finally,we verify the effectiveness of the proposed approach by applying it to mooc site project,where Hadoop is used to store log and picture data.The experiments with various setting of data block sizes have been carried out and the experimental results are analyzed.
Keywords/Search Tags:Big data, Hadoop, Deduplication
PDF Full Text Request
Related items