Research On Failure Recovery Technologies Of High-performance Computing Services

Posted on:2013-07-03

Degree:Master

Type:Thesis

Country:China

Candidate:L Y An

Full Text:PDF

GTID:2248330395480592

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the development of grid computing and cloud computing, service-oriented computinghas been the major trend of application paradigm in high-performance computing. As the keymethod to improve service reliability, many researchers have devoted to the failure recoverytechniques.High-performance computing services have high throughput, high resource dependence, aswell as long-running characteristics, and need the participation of the job when dealing with userrequests. All of this makes the existing failure recovery frameworks for Web Services can noteffectively resolve the issues of failure recovery in high-performance computing service.Consequently, the thesis firstly analyzes the demand for failure recovery in high-performancecomputing service, and then proposes a failure recovery framework named H-FRF in the base ofservice mode and reflection mode. Service context is the basis of service recovery. Therefore,according to the special exection mode of high-performance services, based on typical WebService context, the thesis increases a kind of Task Context used to describe the job running state,and then proposes a recovery-oriented context classification and representation menthod. Duringservice execution, in allusion to various types and large-scale data characteristics of context inhigh-perfermance computing service, we design a2-Step persistence mechanism, hierarchicalmanagement mthod, as well as context pre-migration algorithm, which reduce the interference ofnormal execution of service because of context persistence, and improve the survivability of thecontext, and speed up the access efficiency of context during service recovery. During theimplementation of the recovery process, a service recovery mechanism based on the5-R strategyis established to achieve every level fault recovery, and the adaptive recovery algorithm ofcontext is designed to solve the fast recovery of the service state. Meanwhile, we provide amechnism used for degrading and upgrading job, which solve the efficient running in morenodes and continuously running in fewer nodes when the scale of computing nodes changes. Inaddition, a failure detector of delayed doubt is proposed to provide support for accuracy failurenotice.In the prototype system test, the average recovery success rate achieve at0.885, while theaverage recovery time is reduced by30.4%. Therefore, the framework established by the thesishas a higher recovery success rates and a lower recovery time, and is of a certain value inpractice for failure recovery of high-performance computing service.

Keywords/Search Tags:

High-performance Computing Service, Failure Recovery, Context, Persistence, Failure Detection

PDF Full Text Request

Related items

1	Research On Failure Detection And Recovery Technology Based On SDN Network
2	Improving Availability With Fine-grained Failure Detection And Recovery
3	Research On Control Layer Failure Detection And Recovery Algorithm In SDN Framework
4	Research And Implementation Of Failure Detection And Recovery Techniques In SDN Network
5	Research Of The Failure Detection And Recovery Based On SDN
6	Design And Implementation Of The Failure Recovery Mechanism In MapReduce
7	Research And Implementation On Disaster-recovery Oriented Failure Detection Algorithm
8	Research On Qos-oriented Failure Detection Service In Distributed Systems
9	Large-scale High-performance Computer Cluster Failure Rapid Diagnosis And Automatic Recovery System Developed
10	Research And Implementation Of SDN Failure Monitoring And Recovery Technology