Improving utilization and availability of high-performance computing in space

Posted on:2007-10-09

Degree:Ph.D

Type:Dissertation

University:University of Florida

Candidate:Subramaniyan, Rajagopal

Full Text:PDF

GTID:1448390005965468

Subject:Engineering

Abstract/Summary:

Space missions involving science and defense ventures have ever-increasing demands for data returns from their resources in space. The traditional approach of data gathering, data compression and data transmission is no longer viable due to the vast amounts of data. Over the past few decades, there have been several research efforts to make high-performance computing (HPC) systems available in space. The idea has been to have enough "on-board" processing power to support the many space and earth exploration and experimentation satellites orbiting earth and/or exploring the solar system. Such efforts have led to small-scale supercomputers embedded in the spacecraft and, more recently, to the idea of using commercial-off-the-shelf (COTS) components to provide HPC in space. Susceptibility of COTS components to Single-Event Upsets (SEUs) is a concern especially since space systems need to be self-healing and robust to survive the hostile environment. Fault-tolerant system functions need to be developed to manage the resources available and improve the availability of the HPC system in space. However, resources available to provide fault tolerance are fewer than traditional HPC systems on earth.; Several techniques exist in traditional HPC to provide fault tolerance and improve overall computation rate, but adapting these techniques for HPC in space is a challenge due to the resource constraints. In this dissertation, this challenge is addressed by providing solutions to improve and complement HPC in space. Three techniques are introduced and investigated in three different phases of this dissertation to improve the effective utilization and availability of HPC in space. In the first phase, new model to perform checkpointing at an optimal rate is developed to improve useful computation time. The results suggest the requirement of I/O capabilities much superior to present systems. While the performance of several common HPC scheduling heuristics that can be used for effective task scheduling to improve overall execution time is simulatively analyzed in the second phase, availability is improved by designing a new lightweight fault-tolerant message passing middleware in the third phase. Analyses of applications developed with the fault-tolerant middleware show that robustness of the systems in space can be significantly improved without degrading the performance. In summary, this dissertation provides novel methodologies to improve utilization and availability in space-based high-performance computing, thereby providing better and effective fault tolerance.

Keywords/Search Tags:

Space, High-performance computing, Utilization and availability, HPC, Fault tolerance, Improve, Data

Related items

1	Optimization Techniques Of Proactive Fault Tolerance For Large-scale High Performance Computing Systems
2	Research And Implementation Of High Availability's Key Technology In High Performance Router Software
3	Research On Fault Tolerance Of High-performance Computing With NVRAM
4	Achieving Fault-Tolerance And High-Performance In Grid Applications
5	Research On Performance Optimization Techniques In Fault Tolerant Distributed Systems
6	Research On The Mechanism Of Process Migration For MPI Parallel Processes Oriented High Availability
7	Study And Implementation Of Fault Tolerance For Heterogeneous Parallel Computer
8	Research And Implementation Of Key Technologies On High-Availability Cluster System
9	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
10	Multi-Layer Fault Tolerance Techniques for High Reliability and Performance: Devices, Systems and Data Centers