Achieving Fault-Tolerance And High-Performance In Grid Applications

Posted on:2005-08-06

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y F Chen

Full Text:PDF

GTID:1118360215498497

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The rapid expansion of the Internet and the dramatic increase in available network bandwidth has qualitatively changed how the world computes, communicates, and collaborates. The emergence of computational grids provides a promising infrastructure that can potentially make high-performance computing power accessible to general users as easily and seamlessly as electricity from an electrical power grid. Due to the focus on large-scale resource sharing and innovative problem solving, grids enable applications that are conceptually more complex than current parallel and distributed applica-tions, including the applications that could not be solved before. However, it is not a trivial task to develop a generic solution that can offer fault tolerance and high performance for the wide spectrum of grid applications due to the following reasons:The scale, the heterogeneity, and the dynamic nature of a grid make it difficult to efficiently support grid-scale interprocess communication, coordination, and synchronization, thus signifi-cantly complicating the design of efficient, reliable, and scalable grid applications;Grid applications basically rely on the standard grid protocols for executing their computations on grid resources, but the hourglass design principle makes them too generic to support any fault tolerance mechanisms;Different grid applications have distinct performance goals inherently determined by the appli-cation's characteristics so that they are hard to be addressed via a generic approach.Therefore, this dissertation aims to address the fault-tolerance and/or high-performance issues for the following three typical grid applications that motivate the development of grid technology in both scientific and commercial communities. In a broad sense, these issues share the same goals in effi-ciently performing coordination and synchronization, effectively utilizing network resources, and re-markably overcoming network latency.Computation-intensive applications. With increases in system scale, fault-tolerance has be-come such a critical issue that without its support, grid applications have little chance to finish successfully, especially for those that require days or even months to execute. Frequent sys-tem-wide synchronization and coordination should be avoided since it will result in high over-head and bad scalability in a loosely coupled computing environment. To simplify the con-struction of the applications, fault-tolerance function needs to be separated from application logic functionality and be abstracted as a generic service opened to application programmers. Also, for these computation-intensive applications, a scalable checkpointing and rollback re-covery scheme may be preferable to task replication with voting and consensus for its possi- bility in saving computational resources.Grid information retrieval. In emerging grids, vast volumes of geographically distributed data resources will be pooled together and are available for being integrated into applications. These data resources usually cross over different administrative domains and are structured in distinct ways. Given that data resources are loosely coupled and heterogeneous, the difficulty in effectively retrieving useful knowledge from a grid is more highlighted than doing that from a traditional distributed information system. Therefore, grid information retrieval sys-tems are expected to be enabled to reduce bandwidth utilization, overcome latency, tolerate network disconnection, and utilize the parallelized computational resources in grids.Content delivery services. Information providers who seek for a cost-effective solution today are increasingly choosing Web content delivery services to store, maintain, and provide access to their documents. To address the scalability, availability, and performance problem of their popular Internet sites, information providers commonly use delivery services to replicate or mirror their content over Internet, as exemplified by the prevalence of Content Delivery Net-works (CDNs). Surely, this kind of services is currently a driving force that propels the development of grid technology, and will eventually use a grid-like structure to alleviate server workload, eliminate redundant data traversal over the network, reduce response latency, increase system throughput, and most importantly, deliver desired QoS. However, the CDN performance is dramatically affected by its surrogate placement policy. In a grid environment, the optimal surrogate placement problem for content delivery services becomes a more serious issue as the vast available computing resources drastically enlarge the decision space.In a word, the goals of this dissertation are to develop techniques to achieve fault tolerance for computation-intensive grid applications, efficiency for grid information retrieval, and high-perform-ance for content delivery services in grids. The major contributions are summarized as follows:A novel, mobile agent enabled fault tolerance technique, MACR, that can be easily integrated into grid applications through the separation of concerns, is proposed. The importance of MACR is justified by our finding that no bridging mechanism exists that can integrate or reuse multiple underlying fault tolerance protocols deployed in different administrative domains without modifying them. MACR takes advantage of both the independent and the coordinated checkpointing approaches by exploring the appealing features of mobile agent technology and using mobile agents to carry out coordination tasks for cooperating local processes; as a result, it can not only be made adaptable to the system scale and the network state changes, but also can guarantee low runtime overhead, while at the same time eliminating the possibility of the undesirable domino effect. The feasbility of MACR is demostrated by designing a hybrid checkpointing and rollback algorithm using mobile agents as an aid. The correctness of the al-gorithm is formally proved and the performance and scalability issues are systematically dis-cussed. Techniques are proposed to address the scalability issue and the potential performance bottleneck of the proposed algorithm, and simulations are performed to show the effectiveness and scalability of the proposed protocol and its improving techniques.A flexible layered architecture of mobile agent system that aims to achieve high-performance for grid information retrieval is presented. Based on this architecture, a new genetic algorithm is proposed to schedule the optimal migration of mobile agents with the objective to minimize the total communication cost subject to a user-defined limit of task completion time. The pro-posed genetic algorithm encodes the agent migration trees by the pre-ordered traversal se-quence of tree vertices, together with the children number sequence of corresponding tree ver-tices. Through theoretical analysis and experimental simulations, this encoding is proved to be able to guarantee a one-to-one correspondence between the encoding space and the problem space, and has the advantages of simplicity for encoding and decoding, ease for GA operations, better equilibrium between exploration and exploitation, and fast convergence to the optimal or near-optimal solution.A new surrogate placement strategy, LDASP, is proposed to enhance performance for content delivery services. LDASP aims to address surrogate placement in a manner that minimizes the communication cost while ensuring at the same time the maximization of system throughput. This work differs from existing work on the resource allocation problem in communication networks in that it considers load distribution and processing capacity constraints on surro-gates by modeling the underlying request-routing mechanism, thus guaranteeing a CDN to have minimum network resource consumption, maximum system throughput, and better load balancing among surrogates. An efficient and effective greedy algorithm and an approximate dynamic programming algorithm are developed for a simplified version of the LDASP prob-lem in tree networks. The correctness, efficiency, and effectiveness of the proposed algorithms are systematically analyzed through formal proofs and experimental simulations.

Keywords/Search Tags:

grid computing, fault tolerance, high performance, mobile agent, checkpointing and roll-back recovery, information retrieval, content delivery service

PDF Full Text Request

Related items

1	Research On Fault-Tolerance Technology For Message-Passing System
2	Optimization Techniques Of Proactive Fault Tolerance For Large-scale High Performance Computing Systems
3	Grid Computing Oriented Dynamic Fault Tolerance Service Strategies And Corresponding Algorithms
4	Research On Checkpointing And Rollback Recovery Fault-tolerant Techniques For Mobile Computing Environment
5	Research On Low Overhead Non-blocking Checkpointing Scheme For Mobile Computing System
6	Research On Non-Blocking Coordinated Checkpointing Algorithm In The Mobile Computing Environment
7	Study And Implementation Of Fault Tolerance For Heterogeneous Parallel Computer
8	Study And Implementation Of Application-Level Checkpointing
9	Research On Key Technology Of Content Delivery Based On Service Grid Architecture
10	Incorporating File I/O Into Checkpointing Under Clusters Environment