Font Size: a A A

Research On Optimization Of Data Redundancy Strategy Based On HDFS

Posted on:2015-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y FuFull Text:PDF
GTID:2268330428998734Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
The development of the Internet and the increase of its applications result in theexplosion of Internet data. Traditional data storage and processing technology hasbeen unable to meet the demand for this glowing amount of data. In recent years, theemerging cloud computing has advantages of massive data storage and processingcapacity, high scalability and high reliability. Cloud computing to be used for massivedata storage and processing has become an inevitable trend. In order to improve theperformance of fault tolerance and the availability of data, the redundancy mechanismhas been introduced to the distributed file system of cloud computing. But it alsobrings new challenges of replication management. The hadoop distributed file system(HDFS) uses the full backup redundancy mechanism to solve the problem of faulttolerance, and accesses the nearest one when the backups are distributed to the nodeswith different locations to reduce access latency. However, this approach has thedisadvantages of wasting storage space and poor recoverability. For this reason, someresearchers have proposed the use of erasure codes to enhance data recoverability incloud storage system, but the erasure codes decoding operations consume moresystem resources and increase user access latency. In order to combine the advantagesof both, the redundancy scheme REPERA that combines erasure codes with fullbackups are introduced, but it does not give the methods to calculate the minimumreplica number and select the best location for placing replications.For the lack of existing redundancy strategy of HDFS and analysis of theimproved ones, this paper proposes an adaptive data redundancy strategy RIRS, whichcombines full backup with improved RS erasure code. The strategy can neutralizethose two defects and effectively integrate the advantages of low latency of fullbackup and high reliability of erasure code, and greatly save storage space. Thisstrategy also provides configuration parameters of replica number and erasure codingredundancy for the users. Users can set them according to the needs for adjusting thesystem to the optimum state. In addition, through experimental analysis the erasurecoding algorithm used in the strategy is fit for HDFS, and has the high capability oferror correction and relatively low encoding latency, and can improve the reliability ofthe system while reducing the increase in delay.In addition, for the lack of the replication management of RIRS, we propose adynamic replication management optimization model DRMO. DRMO can obtainminimum replicas according to the requirements of document availability and adjustthe numbers of replications dynamically for low cost and high efficiency of storageservice. Secondly, considering the capacity and blocking probability of each node, themodel also design a balanced replica placement strategy to place the copies to the data node with smallest blocking probability, in order to reduce the access latency andachieve load balancing.Finally, after the detailed analysis of the relevant source code of the replicationmanagement in HDFS, we modify it to achieve the optimization data redundancystrategy based on HDFS and test its function and performance on the self-built hadoopcluster. The function test results show that the system can realize the correspondingfunctions, including the encoding and decoding function, setting the number ofreplication function, location selection function and managing replication dynamicallyfunctions. While the performance test results show that the erasure code we choosehas the best performance, and DRMO can meet the requirements of data validation,save storage space. Though the reading and writing performance test results ofDRMO are not entirely consistent with the theoretical results, but we give a detailedanalysis of the causes in the paper.
Keywords/Search Tags:Cloud Computing, HDFS, erasure codes, redundancy strategy, full backup
PDF Full Text Request
Related items