Rescheduling Of Parallel Simulation Tasks In Grid

Posted on:2008-04-20

Degree:Master

Type:Thesis

Country:China

Candidate:C L Pan

Full Text:PDF

GTID:2178360245997913

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of the Internet and its in-depth application, the research on it is becoming increasingly important and necessary. Network simulation as an irreplaceable method is facing a higher challenge because of the expansion of network model and the accuracy of network performance evaluation. Researchers focusing on large-scale network simulation hope to deal with the challenge by two ways. One is the abstract analysis, in order to reduce the consumption of computing and storage resources. The other way is to study the parallel simulation technology to improve simulation speed and scale. This paper mainly deals with the issue of fault-tolerance which we will face in the implementation of the second solution. The objective is to make parallel and distributed network large-scale simulation better use of grid resources, provide reliable continuous analogue services.This paper aims to add the application level checkpoint and recover functions to Parallel Distributed Network Simulator(PDNS), thus to enhance its fault-tolerant capabilities and make it possible that PDNS simulation tasks can reliablely, sustainedly run in distributed network or grid. That is, when the nodes processing simulation tasks fail, they can carry out process migration by checkpoint file, and then resume processing on the resource provided by the grid scheduling system, which involved the consistency rollback of the rest nodes.First, the paper will explain the motivation to achieve checkpoint and recovery in application level. Then we will introduce the "Application Level Checkpointing Based on Job Progress Description" as a theoretical model and technical reference. Finally, based on in-depth analysis of PDNS's principles, structure and implementation, we try to abstract its run time state data and correspondingly define the data structure to preserve them at the moment of checkpoints. These state data should be enough to correctly recover the process after a fault. To ensure the consistency of the sub-nodes and RTI checkpoints information, the paper should clearly define the steps of preservation and restoration operations also should give out the method to store and recover network communication. Finally, on the basis of the realization of PDNS's application level checkpoints functions this paper tells how to schedule parallel simulation tasks which have fault-tolerant ability in grid. Combing grid resource management and task schedule function, the system bring forward the means for parallel simulation to run on grid platform, which involves how can we detect fault using grid error monitoring and positioning, resume processing on new resources using grid task scheduling and the same time efficiency is ensured, and finally continue processing transparently but not loosing too much original calculation. Meanwhile, in order to guarantee the integrity of topics, with the study results of our research group, this paper try to give out a grid application management system considering survivability enhancing techniques of grid applications...

Keywords/Search Tags:

Large-scale Network Simulation, checkpoint, grid, fault-tolerant, job schedule

PDF Full Text Request

Related items

1	Research On Key Technologies Of Fault Tolerance Of Large Scale Distributed Simulation System
2	A Checkpoint-Based Fault-Tolerant Service In Distributed Systems
3	Checkpoint-based Runtime Dynamic Fault Tolerant In Heterogeneous System
4	Optimization Strategies For Storage In Distributed Checkpoint System
5	Meet The Survivability Of The Research And Implementation Of Collaborative Checkpoint
6	Research And Implementation Of The Automatic Jobs Fault Tolerant Technology Based On Checkpoint
7	For Grid Checkpoint Technology
8	Study On Modeling Large-Scale Grid Platforms And Scheduling Algorithms
9	The Research Of Fault-Tolerant Techniques For Parallel/Distributed Network Simulator PDNS
10	Research On Fault Diagnosis Approach At Sub-network Level In Large-scale Tolerant Analog Circuits