Font Size: a A A

Research On Highly Failure Tolerated Technology And Fast Failure Recovery In Disk Array

Posted on:2011-06-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:S G WanFull Text:PDF
GTID:1118360305492372Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Information technologies have been applied to many aspects of our everyday life. Human society can move ahead continuously with the support of various information services, applications and massive data. This promotes the rapid growth of data capacity and types, and improvement of computer organization, scale and performance. Therefore, the reliability and availability of storage systems become more and more important in computer systems, and receive more attention. The availability of Redundant Array of Inexpensive Disks (RAID) system with high capacity, performance and reliability is defined by the fact of disk model, system scale, data organization and recoverability. Based on study and analysis of data organization of current double disk failure tolerated RAID systems, we aim at reducing the risk of data loss and improving the quality of services, and design novel data organization methodologies to optimize erasure codes and online recovery to shorten the MTTR and increase the availability of RAID systems.By introducing the concept of strip-set, a novel non-MDS coding scheme, Code-M, is presented that can tolerate up to two-disk failures, satisfying the RAID-6 property and applicable to large scale disk arrays to support fast recovery from up to two-disk failures. Code-M is a lowest density code scheme. By evenly distributing the parity in the stripe, it achieves optimal short write complexity. Furthermore, code-M also has good scalability. It supports flexible numbers of disks based on different construction of strip-set. Our theoretical analysis shows that Code-M can speedup single disk failure recovery by a factor of up to 2.69, and reduce double disk failures recovery time by 40.9%, at the cost of only 16.7% extra space, compared to RDP-based RAID with the same number (24) of disks, under the case of same amount of user data in an array and same recovery bandwidth.An analysis on parity generation methodologies of XOR-based highly fault tolerant array codes is made. A novel online recovery optimization on single disk failure named Cross Recovery Scheme, or CRS for short, is introduced, via reconstructing lost data elements by different types of parity chains. It trades the part of the recovery workflow with some additional decoding computational complexity, and makes the remaining recovery workflow more balanced among the surviving disks. Our analysis shows that CRS is effective for most XOR-based highly faulty tolerated array codes. It can reduce the part of the user workload associated with decoding, in addition to reducing part of the recovery workflow. Through quantitative evaluations, by applying CRS to RDP-based disk array, it can reduce 20.84% total recovery workflow and 22.92% total recovery read workflow with 12 disks, and 37.5% maximum recovery read workflow on single disk with 4 disks.A novel dual-working-stack-based cache replacement algorithm named Adaptive Dual-stack LRU or AD-LRU for short, is proposed, aims at improving cache hit ratio, which will reduce the influence on recovery workflow from user workloads and optimize online recovery. By introducing a concept of Stack Efficiency, and analysis on relationships between adaptive stack size and stack efficiency, AD-LRU is built on dual working stacks and single history stack. Instead of using one LRU stack, two LRU stacks is used:one to catch the access of pages with low recency, and the other to catch the access of pages with high recency. The idea is to adaptively adjust the sizes of the history stack, recency and frequency stacks, an overall buffer cache efficiency in terms of hit ratio is improved. Simulation results show that AD-LRU demonstrates higher hit ratio compared to existing popular algorithms under most cases.Through research on above three aspects, by designing a more comprehensive system to adapt to large-scale double disk fault-tolerant disk array organizational methods, and designing new cache replacement to improve the recovery performance for online recovery, it can effectively improve double disk failure tolerant disk array.
Keywords/Search Tags:Storage System, Reliability, Availability, Data Organization, Erasure Code, Non-MDS code, Recovery Algorithm, Cache Replacement Algorithm
PDF Full Text Request
Related items