Font Size: a A A

Cluster System Fault-tolerant Middleware Technology Research

Posted on:2006-01-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:W HuangFull Text:PDF
GTID:1118360185995705Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Keeping system high available and applications reliable has been one of the most important measures in the research area of high performance computing. Cluster is the mainstream architecture for high performance computing because of its low cost and good scalability. And the loose-coupling architecture between nodes makes cluster system easier to implement high available than centralized system. But with the scale of cluster system become more and more large, some new problems are brought along, such as the more frequent failure rate of system components, the scalability problem of software architecture. These problems also bring new challenges to design and implement high available system software. Cluster fault-tolerant middleware tries to combine fault-tolerant, middleware and cluster technologies together to implement an integrative system software for fault-tolerant in cluster system. It is a new approach to keep cluster system available and cluster applications reliable with low cost but high scalability.With the design and implementation of fault-tolerant kernel for cluster operating system in Dawning4000A, this dissertation deeply discusses the key issues of cluster fault-tolerant middleware, which focus on (1) the scalable fault-tolerant middleware framework for large scale cluster system; (2) adaptive and reliable fault-tolerant mechanism for large scale cluster system; (3) evaluating the effect of cluster fault-tolerant middleware by modeling and analysis system availability and application reliability. The contributions of this dissertation include:1. The current high available system softwares can't meet the demand for scalability and performance when system's scale becomes very large. To solve this problem, a new fault-tolerant middleware framework named DCFT-Kernel is proposed in this dissertation. DCFT-Kernel adopts the approach of partition and hybrid architecture from master/slave and peer-to-peer to eliminate the scalability problems of system, software architecture and fault-tolerant mechanism. Furthermore, DCFT-Kernel is constituted by group service, event service, configuration service and programming APIs, which make it being able to provide integrated fault-tolerant function for error detection, error recovery and error notify.2. To implement fault-tolerant services in nonsynchronous distributed system must take the fundamental consensus problem into account. On the other hand, the work base for fault-tolerant middleware is to keep itself reliable. Group service which aims to integrate a group of cooperating processes to provide common fault-tolerant services is proposed in chapter 4 to solve these two problems. Through group membership protocol and reliable multicast...
Keywords/Search Tags:cluster, fault-tolerant, middleware framework, partition, group service, correlated failure, stochastic reward Petri net
PDF Full Text Request
Related items