Font Size: a A A

Large-scale High-performance Computer Cluster Failure Rapid Diagnosis And Automatic Recovery System Developed

Posted on:2013-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2218330368997912Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The large high performance computer system(HPC) develops foundation to prop up function in the Dynamics of the Computational Fluid (CFD). Currently, the large high performance computer system become the main current which computes an equipments by its high quality low price ratio.Along with CFD mission development, the equipment construction speed raises quickly, scale extension, the cluster, each kind of failure immediately increase, be subjected to a personnel a little work knobbiness to move to slowly wait objective circumstance influence, can't discover in time with expel each kind of failure, recover system to normally circulate speed slower, currently have already become a check and supervision compute mission node on time press request to be smoothly finished of one of the decisive bottlenecks.What this discuss is according to the large high performance computer of some unit to gather a group of failures to quickly diagnose with automatically recover a system development.The thesis passes analysis currently the some unit large high performance calculator gather a group of breakdowns to discover to be subjected to the present condition of influence of the artificial factor with exclusion, according to calculation mission of urgent need, put forward 1 to break down the solution that the fast diagnosis recovers with auto.Adoption according to fast the prototype method mode and mold piece turn a design method, at main according to linux system frame realization, make the development break down diagnosis and recover system to become to in brief and in addition pay attention to the in general use of sex and module of the expand of system.The aticle introduces CFD to gather basic concept, characteristic and relation of cluster with large high performance calculator first, and clarify troops to develop a large calculator to gather a group of breakdowns to quickly diagnose with automatically recover the important and realistic meaning of the system, then the point introduced to break down fast diagnosis and automatically recover the design and realization of system.Design part from gather a group of breakdowns handle of the basic process, system carry out target and system analysis to begin and put forward system of total design project, and automatically monitors towards gather a flock of basic appearance, typical model application problem monitor, typical model break down the auto instauration, affairs record and affairs report to the police sub- system of function design and realization did to elaborate on.Finally introduce the test method of system.
Keywords/Search Tags:High performance computer system, Failure, diagnosis, Recover
PDF Full Text Request
Related items