Font Size: a A A

Study On Failure Recovery Mechanisms Of RAID-6Storage Systems

Posted on:2014-01-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:L P XiangFull Text:PDF
GTID:1228330395458597Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The highly dependence on information technology in modern society leads to the ever increasing demand on data reliability and availability. However, the continuing increase of system size and the usage of inexpensive but less reliable components (for economical consideration) make component failures, such as disk failure, more common. Distributed storage systems commonly use fault-tolerant technologies to protect data against failures. When disk (or storage node) failure occurs, the recovery process requires multiple reads to surviving disks, and then regenerates the lost data according to the encoding mechanism. In general, the total data that must be processed during recovery plays a crucial role in system reliability and service performance. Therefore, the key issue of fault tolerant in storage systems is how to repair a failed disk effectively to minimize the system resources consumed by recovery. This dissertation studies the failure recovery mechanisms of storage systems based on RAID-6codes. Its main works and contributions are as follows:(1) Single failure recovery of RDP code based storage systems.RDP code is one of the most important RAID-6codes, which ensures all information is recoverable when there are no more than two disk failures simultaneously. This dissertation studies the single failure recovery problem of RDP code based storage systems, and derives the theoretical lower bound of the number of disk reads needed for any single failure recovery. A new hybrid repair algorithm RDOR-RDP (Row Diagonal Optimal Recovery-RDP) which uses both row parity and diagonal parity for single failure recovery is proposed in this dissertation. RDOR-RDP achieves the theoretical lower bound of disk reads. Besides, RDOR-RDP has the load-balancing property that all surviving disks will experience the same amount of workload to recover the failed disk. The numerical analysis shows that the number of disk reads for single failure recovery of RDOR-RDP is reduced by approximately25%compared with the conventional recovery algorithm. Experiment results show that compared with the conventional recovery algorithm, the average disk read access time and the total recovery time of RDOR-RDP are reduced by15.16%~22.60%and5.72%~12.60%, respectively.(2) Single failure recovery of EVENODD code based storage systems.EVENODD code is one of the most widely studied RAID-6codes, which can tolerant double disk failures simultaneously. Based on the encoding characteristics of EVENODD code, the proposed hybrid recovery mechanism is used for the single failure recovery of EVENODD code. In this dissertation, a theoretical lower bound of the number of disk reads needed for any single failure recovery of EVENODD code is derived. Besides, during recovery, one would expect the recovery read requests are uniformly distributed among all surviving disks. Therefore, this dissertation also provides a sufficient condition that the recovery load is balanced among all surviving disks. Based on the theoretical lower bound of disk reads and the sufficient condition of load balancing, a new recovery algorithm RDOR-EVENODD (Row Diagonal Optimal Recovery-EVENODD) is proposed for the single failure recovery of EVENODD code. RDOR-EVENODD issues the minimum number of disk reads to recover a single failed disk, and also has a load balancing property on all surviving disks. The numerical analysis shows that compared with the conventional recovery algorithm of EVENODD code, the number of disk reads needed for recovery is also reduced by nearly25%. Experiment results show that RDOR-EVENODD outperforms the conventional recovery algorithm in terms of average disk read access time and total recovery time.(3) Sector faliure recovery of RAID-6codes based storage systems.In RAID-6storage systems, except for complete disks failures, there are other failure scenarios, such as scattered sector failures (some portions of a disk are corrupted), and so on. These failures could also cause data loss if they are not eventually recovered. This dissertation addresses the issue of designing efficient recovery algorithms for both sector and disk failures for RAID-6based distributed storage systems. In this dissertation, a new graph-theoretic framework is proposed to define erasure scenarios for storage systems that use RAID-6codes. Based on the graph framework, a necessary and sufficient condition is derived to determine whether an erasure symbol is recoverable in an erasure scenario. Two universal recovery algorithms Graph Shrink Recovery (GSR) and Improved Graph Shrink Recovery (IGSR) are proposed according to the necessary and sufficient conditions. GSR and IGSR can recover all the recoverable erasure symbols for any erasure scenario. Besides, by using GSR and IGSR, the recovered data is used for the later recovery which can effectively avoid the repetitive computations and achieve a better decoding performance.
Keywords/Search Tags:RAID-6codes, RDP code, EVENODD code, Data recovery, Faulttolerance, Reliability
PDF Full Text Request
Related items