Font Size: a A A

Research And Implementation Of PVM-based Cluster Fault-tolerance Method

Posted on:2006-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2168360155472631Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of microprocessor and network technology, computing environment, exemplified by cluster of workstations, has come to be one of the hot-spots in the study of parallel computing system. However, with the increase of nodes in the cluster, the odds of fault made by the failure of any node in the whole system increased. Moreover, with the enlargement of the task scale and the extension of the time, the possibility of fault made by node increased. Therefore, without necessary protection, the fault of any node can make the whole system fail, and a great deal of computing work done before is wasted. Good fault-tolerance ability is required by the parallel system to ensure and enhance its reliability. Checkpoint can save and restore the running state of the program. It is an important means to implement fault-tolerance in parallel system. Checkpointing methods are classified into synchronous checkpointing method and asynchronous checkpointing method. Synchronous checkpointing method has been widely used in Network of Workstation system. The algorithm of synchronous checkpointing is easy, its space overhead is relatively small and it can recover directly. However, system synchronization is needed before creating the global checkpoint, the synchronization will stop the running of the process for a while and the spending on synchronization communication is great. PVM is a popular parallel programming environment. The message-passing mechanism, provided by PVM, is highly efficient in heterogeneous network computing. Though PVM has its own fault-tolerance ability and can check out the fault in the system, it provides no fault-tolerance mechanism to recover the system. Traditional synchronous checkpointing method, based on driving away message mechanism, is effective, but because the relation between the number of assistant control messages imported and the number of nodes is O(N2), with the increase of nodes, the number of these control message will grow greatly, and the time overhead is great. In order to reduce the time overhead of synchronous checkpointing, a PVM-based quasi-synchronous checkpointing method is proposed in this paper. Checkpointing signals are sent to all processes through checkpoint control process. Every node stops running the application program after receiving the signal, and begins checkpoint operation respectively. After finishing the checkpoint operation, each process starts up application programs independently. At the same time, checkpoint counter is imported. It identifies parts of the messages in the communication channel on the checkpoint time, and adopts the delayed record method to create a consistent global state. Synchronous checkpointing method requires every process to synchronize at the initiation and the finish moment of the checkpoint, and constructs a consistent global state through clearing the message in the communication channel. Different from synchronous checkpointing method, the quasi-synchronous checkpointing method just sends the synchronous signal to ask for the checkpoint operation at the initiation moment of checkpointing. Then every node saves the process state independently. The consistent global state is constructed by recording message asynchronously in the communication channel. The quasi-synchronous checkpointing method adopts the advantages of synchronous checkpointing method, and enables each node to save state independently by recording message. Therefore, the overhead of synchronization of checkpointing is reduced greatly, and the operation efficiency of checkpoint is enhanced. This method is implemented in PVM environment. The experimental result shows that the method proposed in this paper has better performance. Finally, the fault-tolerant function of PVM is implemented adopting the quasi-synchronous checkpointing method on the system architecture with redundant node.
Keywords/Search Tags:Parallel, Fault-tolerance, Checkpoint, Quasi-synchronization, Message, PVM
PDF Full Text Request
Related items