Font Size: a A A

An Asynchronous Migration Protocol Which Supports Concurrent Parallel Job Migration In Grid Environment

Posted on:2010-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:H L LiFull Text:PDF
GTID:2178360272995892Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the remarkable and rapid development, Grids is capable of organizing more and more geographically dispersed and idle resources around the world to construct a high-performance virtual computer which provides an unpredicted computing environment for parallel applications. At the same time, the distributed parallel computing environment has been upgraded from the original clusters to Grids. On the one hand, the ever increasing computing resources improve the system throughput significantly. On the other hand, frequent failure of computing nodes brings new trouble as well. The effectiveness and robustness of distributed parallel computing environment is under a severe challenge. As a result, fault-tolerant, load balancing and job migration in for parallel applications becomes hot spots in the high performance computing research.Most of the current researches are limited within homogenous clusters and focused on parallel job migration among the nodes in the same cluster. As the distributed parallel computing environment is upgrading from traditional clusters to Grids, load balancing and fault-tolerant for parallel applications in Grids has becoming a critical issue. Job migration technology is the foundation of load balancing and fault-tolerant. Currently, parallel job migration is based on global consistent checkpoint, which can not meet the concurrency demand of parallel applications. Additionally, most fault-tolerant technologies in Grid environment use synchronous job migration, even though the autonomy of Grid resources makes node failure unpredictable and uncontrollable.Based on MPICH-G2 system, this paper presents a parallel job migration mechanism in Grid environment. Firstly, we implement a signal mechanism from client program to the remote sub-jobs by extend the signal functions in Globus GRAM protocol. Secondly, another student in our laboratory has been working on a new checkpoint library to support parallel job migration in Grid environment. We use this checkpoint library to store the executing status of a sub job and resume this sub job's execution. Thirdly, by study the communication mechanism and topology relation of parallel application in MPICH-G2 environment, we reconstruct the communication relations between sub jobs after job migration successfully. At the same time, this procedure can ensure the validity of the formal communicating mechanism and the consistency of the topology structure among sub jobs. The parallel application can resume execution without any interference after the job migration, which preserve the transparency of the entire parallel computing environment. Finally, based on some popular grid protocols such as DUROC and GRAM, we implement a dynamic parallel job management mechanism and successfully maintain the life cycle of a parallel job. This work extends MPICH-G2 to support dynamic parallel job management.A novel Asynchronous Migration Protocol (AMP) is proposed in this paper to enable the concurrent job migration in Grid environment. Firstly, this protocol ensures the concurrency of live sub jobs by avoiding global consistent checkpoint. Secondly, AMP proposed a novel hierarchical sub job structure to organize sub jobs in Grids. This structure minimized the cross-domain communicating protocol messages and reduced the system latency. Finally, using a time-stamp mechanism and the hierarchical sub job structure, AMP can dynamically update the new physical ID of a migrating sub job to the entire community, which ensures the consistency of communicating relations among sub jobs. To sum up, AMP successfully supports concurrent job migration in Grids.Though some experiments and analysis, we come to the following conclusions: (1)AMP minimizes the cross-domain communications when a migrating sub job promulgate a new address efficiently. In this way, AMP reduces the protocol latency in Grid environment. The communication complexity of AMP is O(N), which only relates to the number of clusters used. When a parallel job does not distributed to many clusters, AMP will remarkably reduce the latency of the migration. (2) Our work barely interferes with MPICH-G2's original direct message communicating mechanism, even though we have modified some of the source code. The experiment results showed that parallel applications endure little system latency with our migration library.The work presented in this paper has innovatively solved the problem of parallel job concurrent migration in Grid environment: (1)In order accommodate with the autonomy of Grid resources, the migration protocol presented in this paper avoids global consistent operations. As a result, instead of global checkpoint, we only make checkpoints in migrating sub jobs. In this way, the live sub jobs can continue to execute. Obviously, such design can reduce global synchronization and system latency, and then ensure the concurrency of a parallel application.(2)The migration protocol in this paper (AMP) can accommodate with the special resource topology in Grids. In Grid environment, resources are located in geographically dispersed clusters. The cross-domain communication is much more expensive then that those in a LAN. In this paper, we use a hierarchical sub job structure to reduce cross-domain communications. Eventually, we accomplish the job migration using a lot of in-LAN communications and little cross-LAN ones.(3)The resource availability is dynamically changed in Grid environment. The formal researches use global synchronizations to ensure the system consistency. While one migration is in proceeding, no other migrations are permitted. Synchronous migration protocol effect the concurrency of a parallel application and against the dynamically changed resource availability in Grids. As a result, another goal of our work is to present an asynchronous migration protocol to solve this problem. Our AMP takes advantage of a novel timestamp mechanism and the hierarchical sub job structure and supports concurrent parallel job migration in Grids.In conclusion, the migration protocol presented in this work has avoided global synchronizations; at the same time, by reducing the cross-domain communications, we minimized the migration latency; together with the timestamp mechanism and the hierarchical sub job structure, AMP supports concurrent parallel job migration. The work presented in this paper has successfully solved the problem of concurrent parallel job migration. Job migration mechanism is the foundation of fault-tolerant and load balancing. As a result, the work presented in this paper is of great significance in the high performance computing research area.
Keywords/Search Tags:Grid, MPICH-G2, Parallel Job Migration, Asynchronous Migration Protocol
PDF Full Text Request
Related items