Font Size: a A A

Automated availability management in large-scale storage systems

Posted on:2005-08-07Degree:Ph.DType:Thesis
University:University of California, San DiegoCandidate:Bhagwan, RanjitaFull Text:PDF
GTID:2458390008984995Subject:Engineering
Abstract/Summary:
Availability is a storage system property that is both highly desired and yet minimally engineered. While many systems provide mechanisms to improve availability---such as redundancy and failure recovery---how to best configure these mechanisms is typically left to the system manager. Unfortunately, few individuals have the skills to properly manage the trade-offs involved, let alone the time to adapt these decisions to changing conditions. Instead, most systems are configured statically and with only a cursory understanding of how the configuration will impact overall performance or availability. While this issue can be problematic even for individual storage arrays, it becomes increasingly important as systems are distributed---and absolutely critical for wide-area peer-to-peer storage infrastructures.;This thesis addresses the problem of providing availability management in peer-to-peer storage systems by first defining a framework for quantifying and measuring availability that takes into account the time-varying, complex nature of host avail ability in peer-to-peer environments. The development of this model is primarily based on the findings of a measurement study of a widely-deployed peer-to-peer file sharing network called Overnet which gives insight into the complex behavior patterns of peer-to-peer hosts.;The thesis then provides an analytical model that investigates the relationship between availability and redundancy. In particular, it addresses the question of how much redundancy is required to ensure a specified level of file availability given a specification of host availability in the underlying system. We consider two well-known redundancy mechanisms, replication and erasure coding, in this analysis.;Finally, this thesis describes the design and implementation of the TotalRecall storage system that uses automated availability management to maintain required levels of file availability. In particular, the TotalRecall system provides file availability as a first-class property by allowing a user of the system to specify the level of file availability that he or she desires. It then automatically measures and estimates the availability of its constituent host components, predicts their future availability based on past behavior, calculates the appropriate redundancy mechanisms and repair policies, and delivers user-specified availability while maximizing storage and network efficiency.
Keywords/Search Tags:Availability, Storage, System, Redundancy
Related items