The Research Of Fault-Tolerant Techniques For Parallel/Distributed Network Simulator PDNS

Posted on:2009-12-16

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Zhao

Full Text:PDF

GTID:2178360278964499

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Network simulation is very important in network behavior analysis and protocol evaluation. As a popular parallel/distributed network simulator, PDNS is widely used. However, it can not get rid of the weakness in system reliability like other typical distributed applications. Checkpointing with rollback recovery is a very useful technique in system fault tolerance. It saves the state of a program when it runs normally by checkpointing, and reconstructs the process according to the state information stored in the checkpoint file while some error causes the program breakdown. And then the program could continue from the last time it checkpointed, thus it saves much time compare to redoing the simulation from the beginning.This paper conducts research on improving the reliability of PDNS with checkpointing and rollback recovery techniques. Distributed checkpointing algorithm is based on single process checkpointers. As for PDNS, checkpointing a member of the simulating federacy is the basic issue. Trough the analysis of checkpointers in different implementation levels, user-level transparent checkpoint is realized based on Condor in one single node of PDNS, and then its performance is examined, and the impact of the numbers of nodes and links in the network topology on the checkpoint overhead and space consumption is also discussed.The next question in PDNS checkpointing is to backup and re-establish the links between the federated members of simulating. PDNS nodes use TCP to communicate in LAN. The internal TCP implementation in Linux is examined first, and then a tool is designed as a kernel module to realize the backup and re-establishment of TCP links between the simulating nodes in PDNS.With the two basic functionalities implemented above, choosing a proper distributed checkpointing algorithm comes to the last question in PDNS fault-tolerance. PDNS uses conservative synchronization in distributed simulation and takes a node as master process which is labeled number 0 in libSynk. Considering these characteristics, Sync-and-Stop coordinated distributed algorithm is chose to achieve the proto fault-tolerant model of PDNS. This article discusses the key issues and main techniques in PDNS fault-tolerance which is helpful to improve the distributed simulator's reliability.

Keywords/Search Tags:

distributed network simulation, fault-tolerance, checkpoint, socket re-establishment

PDF Full Text Request

Related items

1	Research On Key Technologies Of Fault Tolerance Of Large Scale Distributed Simulation System
2	Index-based Quasi-synchronous Checkpointing Protocols In Distributed Systems
3	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
4	The Design And Research Of Process Level Fault-tolerance Based On Checkpoint
5	Research On Fault Tolerance In Distributed Stream Data Processing
6	The Research And Implementation Of Checkpoint Technology Based On WinNT
7	Study Of Fault Tolerance Checkpoint Algorithm In Distributed System And Software Design
8	Research And Implementation Of The Automatic Jobs Fault Tolerant Technology Based On Checkpoint
9	Design And Implementation Of Distributed Stream Computing Framework Fault Tolerance
10	Optimization Strategies For Storage In Distributed Checkpoint System