Font Size: a A A

Research On Erasure Codes And Data Repair Techniques For Multiple Node Failures

Posted on:2015-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:X LinFull Text:PDF
GTID:2348330509960902Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In distributed storage systems, erasure codes could save more storage cost compared with replication. However, the high network traffic communicated during the repair degrades the system performance and limits the adoption of the erasure codes in distributed storage systems. The repair of erasure codes could be divided into “eager repair” and “lazy repair”, where the repair is triggered as the failure is detected in “eager repair”, and the repair will not be triggered until the failure number reaches a given threshold. Compared with “eager repair”, “lazy repair” could reduce the network traffic. Thus, “lazy repair” becomes the main repair style, where the multiple failures become the normal but not the exception. Existing erasure codes or repair schemes mainly focus on the repair for single node failure, which will incur high network traffic and result in low efficiency when repairing multiple failures. This thesis studies the erasure codes and the repair schemes to reduce the network traffic and improve the repair efficiency. The main contributions obtained are as follows.To address the problem of high network traffic when repairing multiple failures of existing erasure codes, this thesis proposes a group repairable code named GRC. To reduce the repair network traffic, GRC groups the blocks within a stripe and generates the local parity blocks within a group. GRC combines the local parity blocks with the same ID within each group into a global parity block by adjusting the generating matrix. The combined global parity blocks and the existing global parity blocks maintain the redundancy. This thesis also proposes GRC's multiple choose decode algorithm named GMCD to reduce the network traffic of decoding and improve the performance. Extensive experimental results demonstrate that, compared with existing RS code, GRC can reduce the repair network traffic by more than 50%. Compared with the BPC, GRC can reduce the repair network traffic by more than 15%-25%.To solve the problem of low repair efficiency of existing repair schemes, this thesis proposes a group data exchange recovery technique named GDER. Different from the traditional centralized repair technique where one newcomer needs to read all the required blocks to regenerate the lost blocks, GDER divides the required blocks into two groups and the blocks within each group are transmitted to the newcomer in each group. Each newcomer regenerates the lost blocks by combing the data received from the providers within a group and the data from the other newcomer. By pipelining the data transmission and decoding in two newcomer, GDER improves the repair efficiency significantly compared with the traditional repair technique. Meanwhile, this thesis proposes a minimum data transfer cost node-selecting algorithm named MDTNS to optimize the repair performance. Extensive experimental results reveal that, compared with the traditional simultaneous repair technique, GDER improve the repair speed by more than 70% with low network traffic cost.Based on the theoretical research results mentioned above, a data repair prototype system based on group repair codes named Group CR is designed and implemented. Group CR adopts the modular design principle, where the encoding and repairing modular are independent with each other to improve the flexibility. Group CR not only supports the GRC and GDER, but can be easily extended to support other codes and repair techniques. The results of thorough experiments in Group CR show that, GRC can reduce the repair network traffic by 50%-60% and 15%-25% compared with RS code and BPC respectively with star structure. GDER can improve the repair speed by 70%-80% compared with the traditional simultaneous repair technique with RS and GRC, which further validates the theoretical analysis and simulation results.
Keywords/Search Tags:Distributed Storage System, Erasure Codes, Lazy Repair, Multiple Failure Repair
PDF Full Text Request
Related items