Font Size: a A A

Research On The Mechanism Of Process Migration For MPI Parallel Processes Oriented High Availability

Posted on:2016-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y S WangFull Text:PDF
GTID:2348330542975773Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of High Performance Computing,the number and scale of components in the system has been increased unprecedentedly.However,Mean Time Between Failure(MTBF)of the system has decreased accordingly,which will seriously influence the reliability of system.Therefore,it is important to equip HPC with fault tolerance capability.Checkpoint/Rollback(C/R)is a widely employed Fault Tolerance strategy in HPC.However,because of the heavy I/O cost,C/R can not fulfill the need required by HPC.Process migration,which is a proactive Fault Tolerance mechanism,is proposed as a complement of C/R.We just need to transfer processes running on a deteriorating node to a spare node,and then resume these processes from the spare node.Firstly,this thesis studies the current development of fault tolerance Mechanism of HPC at home and abroad,and then analyzes the C/R technology and process migration mechanism which are widely employed in HPC.Secondly,in order to checkpoint and rollback the parallel program over InfiniBand clusters,this thesis further researches the traditional C/R framework and InfiniBand communication channel architecture,and then proposes a C/R framework which is based on FTB(Fault Tolerance Backplane)as a complement of traditional C/R framework.On the basis of this new C/R framework,this thesis designs and implements a FTB-based process migration mechanism which adopts FTB as a communication infrastructure to exchange fault-related messages during a process migration.It can significantly improve the fault tolerance capability of open-source high performance of MPI implementation.Moreover,the thesis analyzes the cost of every phase during a process migration and proposes a process migration protocol aiming at the heavy cost of checkpoint data input and transfer process.The protocol can avoid the I/O cost of checkpoint data being written into local file system and shorten the time of migrating processes away from the health-deteriorating node.As a result,the overall performance of proactive fault tolerance for HPC can be improved.
Keywords/Search Tags:High Performance Computing, MPI parallel program, Checkpoint/Rollback, Fault Tolerance, Process Migration
PDF Full Text Request
Related items