Font Size: a A A

Research On Multi-Replica Fault Tolerant Technology In MPI Environment

Posted on:2016-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:F J WuFull Text:PDF
GTID:2308330503478055Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of science and technology, we have been faced with some problems of large-scale data processing and calculation, such as DNA profiling in genetic engineering, the accurate prediction of global climate and the calculation of ocean circulation. These problems are difficult to be solved if we use general serial computing model, while parallel computing model can greatly improve the calculation speed, has become an effective way to solve this kind of problem.Now, the realization of parallel computing mainly relies on programming parallel library. There are two programming model-shared memory model and the message passing model, due to the rapid development of LAN, the parallel programming library MPI based on message passing model has become the default standard of parallel programming.With the scale of parallel computing system becoming larger and operating time becoming longer, the probability of failure in MPI computing system becomes larger and larger. Because of the fact that fault tolerance capability is poor in the existing MPI computing environment, further more, failure of compute node will lead the entire program to collapse, fault-tolerant research becomes the hot point in MPI computing environment. At present, fault-tolerant technology includes the checkpoint/rollback and redundancy technology in the MPI environment, but these solutions have some limitations, checkpoint/rollback has a low reliability, and with the increasing scale of the system, the effective working time will be less and less. Recent redundant technology has not considered the efficiency of executing in the MPI environment built by heterogeneous PC nodes, resulting in low efficiency of implementation.This paper, based on the checkpoint technology and redundant technology, presents a multi copy fault tolerant scheme-R-MPI. R-MPI designs a Hierarchical fault detection structure, and run fault detection protocol-PUSH protocol to detect the failure of compute nodes. The default redundancy configuration of R-MPI uses two nodes as a logical group, perform the same calculation task, to provide transparent fault tolerance to users, the system will still be able to operate normally when any one node fails in logical group. In the communication between logical groups, R-MPI always uses the node which has a higher computational performance to send data to other logical groups, which has a higher efficiency compared to other redundant technology. At the same time, R-MPI also provides a flexible redundant policy configuration and supports dynamic redundancy, further improves the reliability of the system.Based on the designed redundant scheme, we design and implement the R-MPI prototype system. We compared R-MPI with the existing fault-tolerant scheme by the prototype system test and simulation test. The results showed that compared to other fault-tolerant schemes, R-MPI is able to ensure the reliability of system at the same time with less redundant messages and a higher efficiency.
Keywords/Search Tags:MPI parallel compute, fault tolerance, redundancy, high efficiency
PDF Full Text Request
Related items