Font Size: a A A

Failure-Aware Reconfigurable Distributed Virtual Machine for dependable and high productivity computing

Posted on:2009-07-21Degree:Ph.DType:Dissertation
University:Wayne State UniversityCandidate:Fu, SongFull Text:PDF
GTID:1448390005461236Subject:Computer Science
Abstract/Summary:
Modern networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Therefore, it is important to ensure the availability and adaptivity of computing services. To this end, we present Failure-Aware Reconfigurable Distributed Virtual Machine ( FAR-DVM) framework to build failure-resilient and dependable high-productivity computing systems.;The framework monitors and analyzes node, cluster and system wide failure behaviors and forecasts prospective failure occurrences based on quantified failure dynamics. The prediction results are utilized to manage system resources in failure-aware manner. The system management components autonomically construct resilient and dependable services and integrate geographically distributed resources into a seamless environment.;Within FAR-DVM framework, we propose hPREFECTS for proactive failure management. It collects failure events from compute nodes at runtime and constructs a failure signature for each event. It then analyzes the temporal and spatial correlations among failure signatures in different system scopes. The quantified correlation data is used by a failure predictor in forecasting the occurrence time of failures in the near future.;To manage system resources in a failure-aware manner, we also propose a construction and reconfiguration strategy for distributed virtual machines (DVM). It leverages the failure prediction results in resource management. We consider both the performance and reliability status of compute nodes, and define a capacity-reliability metric to combine the effects of both factors in node selection. We propose Best-fit algorithms with optimistic and pessimistic selection strategies to find the best qualified nodes on which to construct and reconfigure DVMs.;We have designed and implemented a prototype of FAR-DVM and evaluated it in production environments. The hPREFECTS achieves more than 76% accuracy in offline prediction of failures by using the Los Alamos HPC traces. For online predictions, its accuracy is more than 70% in the Wayne State Computational Grid. We enhance the system productivity by using our proposed failure-aware resource management strategy with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 91.7% with 83.6% utilization of relatively unreliable nodes.;Complement to the work on failure-aware resource management, we have also proposed a service migration mechanism which moves runtime computing services from one compute node to another, in face of system anomalies. To evaluate the goodness of migration polices, we have investigated the migration decision problem for load balancing. We derive the optimal time for service migration with the objective of minimizing migration frequency, and obtain the lower bound of the destination server capacity.
Keywords/Search Tags:Failure, Computing, Distributed virtual, System, Migration, Dependable
Related items