Font Size: a A A

Recovery in fault-tolerant distributed microcontrollers

Posted on:2004-03-11Degree:Ph.DType:Thesis
University:University of California, Los AngelesCandidate:Hwang, Riki I-MingFull Text:PDF
GTID:2468390011971041Subject:Computer Science
Abstract/Summary:
A critical problem facing both the government and commercial space program is the need for lower cost, higher performance and lower power consumption for on-board processing. Special radiation hardened processors have been developed to operate in the space radiation environment, but they are typically one to two orders of magnitude behind the performance of commercial devices, and they consume much more power. Yet there is a need for much greater processing performance in most future space missions.; The use of commercial (designated COTS Commercial Off-the-Shelf) processors in space has been prevented by the fact that the space radiation environment causes a unacceptably high transient error rate—derailing their computations every few hours [MESS 92]. However, protective redundancy can be employed along with the technology of fault-tolerant computing to automatically recover from such errors and thus enable their use.; This thesis focuses on one aspect of this problem, the embedded microcontrollers highly integrated computer system on a single chip that, not unlike those used in modern automobiles, control various subsystems that make up a spacecraft. This thesis examines tradeoffs and experiments with design techniques required to implement fault-tolerant distributed networks using embedded microcontroller processing nodes.; A new fault-tolerant node architecture was developed that allows differing amounts of redundancy to be employed with minimal design change. This includes a special isolated wire-or output system that allows modules to be powered down to recover from some potentially destructive radiation events (latchup). An novel recovery approach was developed that uses comparison voting for error detection and recovery but also employs a “stable” set of recovery actions to allow recovery if multiple errors or Byzantine behaviors occur. Finally, a redundant intercommunication architecture between embedded processing nodes was developed that provides fault-tolerance in communications between them.; A testbed has been constructed, a real-time executive has been developed, and a supporting test environment has also been implemented to allow fault-insertion testing of the experimental architecture. Our initial results strongly support the viability of the fault-tolerance approaches we have developed.
Keywords/Search Tags:Recovery, Developed, Fault-tolerant, Space, Commercial
Related items