Font Size: a A A

Cluster Oriented Fault Tolerance For MPI Parallel Applications

Posted on:2006-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:R N XueFull Text:PDF
GTID:2178360182983602Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a cost-effective solution to some important fields demanding high-performance computing, the cluster system has been one of the leading roles in parallel processing in recent years. Message Passing Interface (MPI) is widely used in parallel computing these days. This interface specifies a powerful message passing mechanism. MPICH is a portable implementation of MPI. The most widely used implementation of MPICH in cluster is P4 parallel library. As one of the most important programming interface in cluster environment, MPI should equip with fault tolerance capacity to improve its reliability and availability. Checkpointing and Rollback Recovery (CRR) is an important fault tolerance technology for cluster. Based on the idea of time redundancy, it helps the system recover from failures through rolling back to some consistent global state saved in checkpoints.This thesis analyzes the hierarchy design of MPI, P4 implementation of MPICH and various approaches to achieve correctness and effectiveness of CRR mechanism. Many aspects of CRR have been studied carefully. These researches result in ChaRM4-MPI, a "Checkpoint-based Rollback Recovery and Process Migration System for Message Passing Interface". This system firstly centralize the task management of MPI processes, then it design and implement some important fault tolerant mechanism: 1) Coordinated Checkpointing;2) Synchronized Rollback and Recovery mechanism;3) Synchronized Process Migration. These designs serves as the core and basic functions for this system, and brings all the design traits to achieve the objectives of user transparency, user-level implementation and good performance into full play.
Keywords/Search Tags:Cluster, Message Passing Interface, Checkpointing, Rollback Recovery, Process Migration
PDF Full Text Request
Related items