Font Size: a A A

Research On The Data Redundancy Technique For The Large-scale And Distributed Storage Systems

Posted on:2013-06-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z HuangFull Text:PDF
GTID:1268330392473828Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Distributed Storage Systems (DSSs) connect distributed storage nodes by the tech-niquesofnetworkcommunicationprotocolsandstoremassivedata. TheaimofDSSsistoprovide high reliability, low consumption and high efficient storage service. To overcomethe challenges, such as weak data readability, high data-maintenance traffic, complicateddata-allocation and hardness on selecting service nodes, this dissertation carries out theresearches on reading redundant data, maintaining redundant data, allocating redundantdata, selecting service nodes, and then obtain the contributions as follows:Toovercomethechallengeofweakdatareadabilityusingerasurecodes,basedontheanalysis of the expected cost of the random and ordered access, this dissertation proposesthe Exact Hierarchical Codes (EHC in short), which can reduce the download traffic, thecomplexity of decoding and the waiting delays for the decoding process. The basic ideaof EHC is to reduce the download traffic and the complexity of decoding by the recon-struction techniques based on several small groups and several low layers, and reduce thewaiting delays in waiting to decode by mapping the code construction to the tree struc-ture and utilizing the postorder traversal based algorithm (PTBA in short) to select theoriginal blocks and encoded blocks in the low layer as much as possible. As comparedto the traditional erasure coding scheme, EHC efficiently utilizes the characteristics ofconstruction with multiple layers, multiple groups and low repair degree, applies PTBAwith the property of selecting nodes in low layers, so EHC can reduce the download time,computational cost of decoding and speed up the decoding operation.To overcome the challenge of high data-maintenance traffic, based on the analysis ofthe causes and characteristics of data-maintenance traffic, this dissertation proposes theER-Hierarchical Codes (ERHC in short), which can efficiently reduce the maintenancetraffic. The basic idea of ERHC is to apply the technique of Regenerating Codes into thecomplex construction of Hierarchical Codes, which directly divides the blocks of Hier-archical Codes into fragments, maintains the original construction of Hierarchical Codeswith multiple layers and multiple groups, preserves the high reliability of HierarchicalCodes, reduces the number of nodes to participate in the repair process and also reducesthe amount of repairing data blocks. As compared to previous erasure coding schemes,ERHC efficiently utilizes the characteristics of construction with small groups by Hier- archical Codes and techniques of combining information by Regenerating Codes. Basedon building the code construction with multiple layers, multiple groups and multiple frag-ments, ERHC can efficiently reduce the data-maintenance traffic under the constraint ofhigh reliability and low storage overhead, perform stably in different repair modes that isgood for applications, and also reduce the computational complexity for repair.To overcome the challenge of high complexity of the data-allocation, based on theanalysisofthecomplexityofdata-allocation, thisdissertationproposesanoptimalstorageallocation (OSA in short) scheme based on the generating function, which can optimalthe storage allocation, where the data redundancy is minimized under the constraint of thegivenhighreliability. Thebasicideaistomaptherelationshipbetweendatareliabilityandreliabilityofseveralcorrespondingstoragenodesintotherelationshipbetweengeneratingfunction and several multiplied factors, and obtain the relations of several parametersfor optimal storage allocation, rules of reducing computational cost and conditions ofstopping searching by the detailed proofs of the generating function. As compared to thetraditional methods, OSA efficiently utilizes the characteristics of simply expression andeasiness for analysis, efficiently reduces the data redundancy, reduces the searching spaceand also simplifies the process to calculate the data reliability.To overcome the challenge of hardness of selecting service nodes, based on the anal-ysis of the status of service nodes and access characteristics of users, this dissertationdefines a node coverage problem for the bipartite graph based on the data popularity,proves this problem is an NP complete problem and proposes a mechanism to select ser-vice nodes (SNBS in short) based on the skewness of the data access, which can reducenodes cost. The basic idea is to obtain the key parameters by the probabilistic analysisand conduction, apply a parallelized greedy algorithm based on the failure probability ofaccess, which can spin down storage nodes as many as possible under the constraint of thelow failure probability for users’ access. As compared to the traditional method, withoutthe data movement, SNBS efficiently analyzes the failure probability of users’ access andthe impact of spinning down storage nodes, utilizes parallelized greedy algorithm so as toreduce the service cost, covers most data objects and adapts to different system policiesand network environments.
Keywords/Search Tags:distributed storage, erasure code, redundancy, reliability
PDF Full Text Request
Related items