Fault tolerance in adaptive real-time computing systems

Posted on:2003-06-13

Degree:Ph.D

Type:Dissertation

University:Stanford University

Candidate:Yu, Shu-Yi

Full Text:PDF

GTID:1468390011980671

Subject:Engineering

Abstract/Summary:

Real-time computing systems often have stringent reliability and performance requirements. Failures in real-time systems can result in data corruption and lower performance, leading to catastrophic failures. Previous researchers demonstrated the use of reconfigurable hardware (e.g. Field-Programmable Gate Arrays, or FPGAs) for implementing cost effective fault tolerance techniques in general applications. In this dissertation, we developed techniques to design reliable real-time systems on FPGAs.; To demonstrate the effectiveness of our design techniques, we implemented a robot control algorithm on FPGAs. Various fault tolerance features are implemented in the robot controller to ensure reliability. Our implementation results show that the performance of the FPGA-based controller with triple modular redundancy (TMR) is comparable to that of a software-implemented control algorithm (with TMR) in a microprocessor. We developed a roll-forward technique for transient error recovery in TMR-based real-time systems. Our technique does not need any re-computation and therefore significantly reduces timing overhead associated with conventional recovery techniques. Analytical results show that our recovery scheme can significantly improve reliability of TMR systems compared to conventional approaches. Implementation results in the robot controller design demonstrate that our recovery scheme introduces very small area overhead.; The conventional approach to permanent fault repair in FPGAs is to reconfigure the design so that the faulty part is avoided. However, for TMR systems with high area utilization or long mission times, this approach may not be applicable due to non-availability of additional hardware resources. In such circumstances, our new permanent fault repair scheme reconfigures the original TMR-based design into another fault tolerant design of smaller area so that the faulty elements are avoided. However, unlike TMR systems, extra delays during transient error recovery may occur. Three new design techniques for this repair scheme are presented. Analytical results show that these techniques can significantly reduce the delay overhead due to rollbacks. The effectiveness of our repair approach is demonstrated using the robot controller design. A repair scheme is also designed for FPGA interconnects. Unlike conventional schemes that use redundant buses, our scheme only needs a spare wire. Our scheme can repair systems from failures caused by single faulty wire connecting FPGAs.

Keywords/Search Tags:

Systems, Fault, Real-time, Scheme, Failures, Repair, Fpgas

Related items

1	Fault-Tolerant Scheduling Of Real-Time Tasks In Heterogeneous Systems
2	Research Of Fault-tolerant Control For The Complex Control Systems
3	Fault recovery in discrete-event systems with intermittent and permanent failures
4	Design Optimization Of Security-Critical Real-Time Applications With Fault-Tolerance Enhancement
5	Schedulability Analysis For Fault-Tolerant Real-Time Systems
6	The Research Of Real-time Fault-Tolerant Mechanism In Distributed Real-time System DRTAS
7	Research Of Some Scheduling Problems For Real-Time Tasks On Heterogeneous Clusters
8	Schedulability Analysis For The Fault-Tolerant Hard Real-Time Systems
9	Research On Energy-Efficient Scheduling Algorithm Of Distributed Real-Time Systems
10	Research On The Real-Time Fault-Tolerant Scheduling Algorithms For Distributed Systems