Font Size: a A A

Research And Implementation On Transparent Fault-Tolerant Computation Based On Active Replication

Posted on:2006-10-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:X F DaiFull Text:PDF
GTID:1118360215959735Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
How to solve synchronization has become an issue of bottleneck in developing fault-tolerant computation system due to its complexity. Moreover, in researching and developing fault-tolerant computation system based on software implementing or soft and hardware co-implementing fault-tolerant mechanism, the most difficult and significant thing is always the transparency of fault-tolerant mechanism. In addition, the connatural shortage within the way that rollback-based fault recovery mechanism deals with I/O operation should be overcome. Further, Byzantine-faults in computation system cannot be tolerated by semi-active replication or primary-backup redundancy, but active replication is effectual for crash-failures and Byzantine-faults. Herein, the dissertation researches and implements a TMR(triple modular redundancy) active replication transparent fault-tolerant computation system based on Linux and PC platform.Firstly, according to the definition and the essential of fault-tolerant computation, the function model of fault-tolerant computation system is constructed with the help of Petri net tool. Then based on the constructed model, the synchronization issue is theoretically brought forward, and it is taken that the synchronization theorem is the policy which one replication process is suspended to wait for other replication processes. Subsequently, different types of synchronization mechanism of fault tolerant system are downright anatomized in terms of the synchronization theorem.The synchronization requirement and the synchronization policy of TMR active replication fault-tolerance system are emphasized with the assumption that the delitescence of error of process's kernel data is ignored, so that non-deterministic troubles can be thoroughly eliminated. Thus, the synchronization arithmetic of TMR active replication fault-tolerance system is designed according to the synchronization theorem. Further, the transparent synchronization mechanism is implemented by ptrace() system call of Linux to suspend process or modify the parameter and return value of system calls. Additionally, the fault-tolerance overhead of the TMR active replication fault-tolerance system is reckoned by Markov Reward Model tool, and the conclusion that the fault-tolerance overhead is directly proportional to synchronization frequency is drawn.A TMR active replication transparent fault-tolerant system is designed and implemented by the transparent fault-tolerant arithmetic on PC/Linux platform and the highly reliable two-level vote mechanism. Moreover, the transparent fault-tolerance capability of the system is validated through fault injection testing.In order to improve the performance, this TMR active replication transparent fault tolerant computation system is tested with respect to the fault tolerance overhead. The result follows the theoretic conclusion. Further, the cause is analyzed. The conclusion that the overhead primarily generates from the communication latency of synchronization message and the original asynchrony distance between active replication processes is drawn. Afterwards, the original asynchrony distance is first decreased by a timeout-based message ordering protocol coming from literatures. In sequence, the communication latency of synchronization message is optimized by a kind of low latency communication way over Ethernet(LLCE) that is designed through bypassing TCP/IP protocol and directly programming Ethernet interface controller. LLCE has achieved lower latency and higher bandwidth than TCP/IP communication. Thus, the fault tolerant overhead is reduced to a certain extent.
Keywords/Search Tags:fault-tolerant computation, active replication, synchronization, transparence, fault tolerance overhead
PDF Full Text Request
Related items