Font Size: a A A

Enhancing availability in large scale storage systems and services: Architectures and techniques

Posted on:2010-06-18Degree:Ph.DType:Dissertation
University:Georgia Institute of TechnologyCandidate:Seshadri, SangeethaFull Text:PDF
GTID:1448390002478383Subject:Computer Science
Abstract/Summary:PDF Full Text Request
Enterprises today are dealing with extremely large amounts of digital information that continues to grow at an astonishing rate. Online business models, regulatory compliance and business intelligence requirements have not only mandated enterprises to retain large amounts of data for significant lengths of time but have also increased the reliance on anytime and anywhere access to this information. Consequently, the storage systems that serve as repositories for these huge volumes of critical data are the foundations of today's data centers. Unavailability of these systems results in losses amounting to millions of dollars per hour and could bring organizations to a grinding halt. On the other hand, storage software (firmware, middleware) and systems are becoming much more complex and existing failure recovery mechanisms are insufficient to handle the scale of these systems while meeting high availability and service quality expectations. In addition, the concurrent development and quality assurance processes, the large number of possible test scenarios and the large scale of these systems and services imply that failures will be the norm rather than the exception. Therefore achieving high availability and reliability in storage systems remains a major concern and an open research challenge.;Most existing work in the domain of storage system availability addresses failures of the storage media (such as disks) and recoverability from these failures. However, failures at the firmware and middleware layers remain largely unaddressed. Achieving high-availability in these layers poses unique challenges. At the firmware layer, fine-grained recovery is an effective approach to reduce recovery-time. However, complex recovery semantics, dynamic interactions, recovery dependencies between large volumes of concurrent tasks and legacy architectures pose serious challenges. At the middleware layer, a widely recognized open problem is how to provide fault-isolation and improve system availability without disrupting the system's functionality or limiting its scalability. Over the past few years, storage clusters consisting of thousands of commodity machines built specifically to serve the needs of large scale distributed data intensive applications where decentralization, high availability, and autonomy are key design principles have become common, exemplified by Amazon S3 (Simple Storage Service) [4], Google File System [69] and IBM System S [34, 76]. Another class of functionality rich dedicated storage middlewares also offer storage management and resource virtualization capabilities. While the scale of these systems result in new challenges [3], the nature of the applications present new opportunities. We can try to utilize application semantics, failure characteristics, access patterns and consistency models to define novel application-specific availability enhancing techniques at the middleware layer which go beyond traditional techniques such as replication [46] and process-pairs [72, 48].;This dissertation research addresses these challenges in depth across different storage architectures. We make the following contributions: First, we develop a recovery conscious framework for multi-core architectures and a suite of techniques for performing efficient fine-grained recovery (micro-recovery) in storage controller firmware that can be retrofitted into legacy code. The framework includes a task-level recovery mechanism, the Log(Lock) architecture that allows system state restoration during micro-recovery, and recovery-conscious scheduling algorithms that are designed to reduce the ripple effect of failure and improve recovery efficiency and system availability. Our framework also provides guidelines for system developers to perform effective mappings of system tasks to critical framework parameters aiming at improving availability by serializing dependent tasks and enhancing recovery efficiency, while sustaining high performance and system throughput.;Our second technical contribution addresses the storage middleware availability. We first develop the notion of hierarchical middleware architectures by organizing critical cluster management services into a hierarchical overlay network, which separates persistent application state from global system control state. We demonstrate that by trading some symmetry for better fault isolation, hierarchical storage middleware architectures can significantly improve availability and reliability of enterprise scale storage systems. In addition, we develop the notion of operator reuse and a suite of reuse techniques to improve data availability. The key idea of operator reuse is to efficiently utilize system resources by exploiting reuse opportunities in both operators and persistent state of computing nodes. We demonstrate our design through STREAMREUSE, a reuse-conscious store-forward network of storage nodes, which offers distributed stream query processing services. By 'reuse-conscious', we mean that the system is provided with the ability to modify operators and migrate services at runtime to maximize reuse opportunity. Our analytical and experimental results show our storage middleware solutions are efficient and effective in enhancing data availability and system availability of large scale storage systems.
Keywords/Search Tags:Storage, Large, Availability, System, Enhancing, Architectures, Middleware, Services
PDF Full Text Request
Related items