Font Size: a A A

Cluster-based storage systems with high scalability

Posted on:2006-06-08Degree:Ph.DType:Dissertation
University:The University of Nebraska - LincolnCandidate:Zhu, YifengFull Text:PDF
GTID:1458390005996394Subject:Computer Science
Abstract/Summary:
In recent years, high-end computing has undergone two significant changes: (1) an increasing focus on data-intensive applications, such as data mining, computational biology, and high energy physics, and (2) a paradigm shift from tightly coupled high-end proprietary computing systems to a loosely coupled cost-effective platform that consists of networked commodity machines, also known as clusters. Thus a reliable and scalable storage infrastructure in clusters becomes increasingly crucial for high-end computing. This dissertation investigates the effectiveness of utilizing the existing disks to build a cluster-based storage system and addresses the key problems that limit the scalability of such cluster-based storage systems from four different levels: the block data level, the metadata level, the file data level, and the application level.; At the block data level, this dissertation proposes a novel and simple replacement scheme, called RACE, which differentiates the locality of I/O streams by actively detecting access patterns inherently exhibited in two correlated spaces: the discrete block space of program contexts from which I/O requests are issued and the continuous block space within files to which I/O requests are addressed. RACE is shown to significantly outperform LRU and all other state-of-the-art cache management schemes studied in this dissertation, in terms of hit ratios. At the metadata level, this dissertation exploits the temporal locality of metadata accesses to improve metadata access performance by designing a Hierarchical Bloom filter Array (HBA) scheme that decentralizes the metadata management. Our implementation indicates that HBA with 16 metadata servers can reduce the metadata operation time of a single-metadata-server architecture by a factor up to 43.9. A theoretical model that incorporates the staleness to estimate false rates of Bloom filters is proposed to support adaptive Bloom filter updating. At the file data level, this dissertation proposes to utilize redundant data to optimize the performance for large data accesses by dynamically scheduling I/O requests among data servers to improve I/O performance. At the application level, this work conducts a case study for a popular I/O intensive application, parallel BLAST, and uses this application as a benchmark to evaluate the techniques proposed at the file data level.
Keywords/Search Tags:Data, Cluster-based storage, Application, I/O requests, Systems
Related items