Font Size: a A A

Compiler assisted application-level fault tolerance in distributed systems

Posted on:2006-10-04Degree:Ph.DType:Dissertation
University:Arizona State UniversityCandidate:Karablieh, FerasFull Text:PDF
GTID:1458390005496492Subject:Computer Science
Abstract/Summary:
Developing fault-tolerant applications is not an easy task and it typically requires specialized hardware or software support that is neither portable nor customizable. This dissertation shows that it is possible to use compiler techniques to provide transparent application-level fault tolerance to applications in distributed systems. It considers both parallel applications in which a set of communicating processes collaborate on solving a computation as well as standalone processes that might need to be migrated and restarted on machines of different architectures. For parallel computations, fault tolerance is introduced through active replication. Parallel programs coded using the message passing interface standard are transformed by a preprocessor to execute in replicated forms that can tolerate node failures efficiently without interrupting the execution. This dissertation shows that active replication for fault tolerance outperforms checkpointing based techniques in terms of scalability and execution overhead for massively parallel computations. For standalone processes, heterogeneous fault tolerance is introduced through checkpointing and rollback recovery. A preprocessor transforms programs to semantically equivalent fault-tolerant ones that can checkpoint their states periodically to stable storage, the saved states can be used to restart the program in case of failure on machines with same or different architectures. Heterogeneous fault tolerance is provided for single as well as multi-threaded programs written in C programming language using posix threads.
Keywords/Search Tags:Fault tolerance, Distributed systems
Related items