Font Size: a A A

Research On Failure Recovery Technologies Of High-performance Computing Services

Posted on:2013-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:L Y AnFull Text:PDF
GTID:2248330395480592Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of grid computing and cloud computing, service-oriented computinghas been the major trend of application paradigm in high-performance computing. As the keymethod to improve service reliability, many researchers have devoted to the failure recoverytechniques.High-performance computing services have high throughput, high resource dependence, aswell as long-running characteristics, and need the participation of the job when dealing with userrequests. All of this makes the existing failure recovery frameworks for Web Services can noteffectively resolve the issues of failure recovery in high-performance computing service.Consequently, the thesis firstly analyzes the demand for failure recovery in high-performancecomputing service, and then proposes a failure recovery framework named H-FRF in the base ofservice mode and reflection mode. Service context is the basis of service recovery. Therefore,according to the special exection mode of high-performance services, based on typical WebService context, the thesis increases a kind of Task Context used to describe the job running state,and then proposes a recovery-oriented context classification and representation menthod. Duringservice execution, in allusion to various types and large-scale data characteristics of context inhigh-perfermance computing service, we design a2-Step persistence mechanism, hierarchicalmanagement mthod, as well as context pre-migration algorithm, which reduce the interference ofnormal execution of service because of context persistence, and improve the survivability of thecontext, and speed up the access efficiency of context during service recovery. During theimplementation of the recovery process, a service recovery mechanism based on the5-R strategyis established to achieve every level fault recovery, and the adaptive recovery algorithm ofcontext is designed to solve the fast recovery of the service state. Meanwhile, we provide amechnism used for degrading and upgrading job, which solve the efficient running in morenodes and continuously running in fewer nodes when the scale of computing nodes changes. Inaddition, a failure detector of delayed doubt is proposed to provide support for accuracy failurenotice.In the prototype system test, the average recovery success rate achieve at0.885, while theaverage recovery time is reduced by30.4%. Therefore, the framework established by the thesishas a higher recovery success rates and a lower recovery time, and is of a certain value inpractice for failure recovery of high-performance computing service.
Keywords/Search Tags:High-performance Computing Service, Failure Recovery, Context, Persistence, Failure Detection
PDF Full Text Request
Related items