Font Size: a A A

Fault tolerance in adaptive real-time computing systems

Posted on:2003-06-13Degree:Ph.DType:Dissertation
University:Stanford UniversityCandidate:Yu, Shu-YiFull Text:PDF
GTID:1468390011980671Subject:Engineering
Abstract/Summary:
Real-time computing systems often have stringent reliability and performance requirements. Failures in real-time systems can result in data corruption and lower performance, leading to catastrophic failures. Previous researchers demonstrated the use of reconfigurable hardware (e.g. Field-Programmable Gate Arrays, or FPGAs) for implementing cost effective fault tolerance techniques in general applications. In this dissertation, we developed techniques to design reliable real-time systems on FPGAs.; To demonstrate the effectiveness of our design techniques, we implemented a robot control algorithm on FPGAs. Various fault tolerance features are implemented in the robot controller to ensure reliability. Our implementation results show that the performance of the FPGA-based controller with triple modular redundancy (TMR) is comparable to that of a software-implemented control algorithm (with TMR) in a microprocessor. We developed a roll-forward technique for transient error recovery in TMR-based real-time systems. Our technique does not need any re-computation and therefore significantly reduces timing overhead associated with conventional recovery techniques. Analytical results show that our recovery scheme can significantly improve reliability of TMR systems compared to conventional approaches. Implementation results in the robot controller design demonstrate that our recovery scheme introduces very small area overhead.; The conventional approach to permanent fault repair in FPGAs is to reconfigure the design so that the faulty part is avoided. However, for TMR systems with high area utilization or long mission times, this approach may not be applicable due to non-availability of additional hardware resources. In such circumstances, our new permanent fault repair scheme reconfigures the original TMR-based design into another fault tolerant design of smaller area so that the faulty elements are avoided. However, unlike TMR systems, extra delays during transient error recovery may occur. Three new design techniques for this repair scheme are presented. Analytical results show that these techniques can significantly reduce the delay overhead due to rollbacks. The effectiveness of our repair approach is demonstrated using the robot controller design. A repair scheme is also designed for FPGA interconnects. Unlike conventional schemes that use redundant buses, our scheme only needs a spare wire. Our scheme can repair systems from failures caused by single faulty wire connecting FPGAs.
Keywords/Search Tags:Systems, Fault, Real-time, Scheme, Failures, Repair, Fpgas
Related items