Font Size: a A A

Research On Availability Evaluation For Supercomputer Systems

Posted on:2010-11-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ZhengFull Text:PDF
GTID:1118330332478653Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Supercomputer systems are important strategic resources that every country manages to own, and their performance is their lives. Nowadays, while the supercomputers'performance improves at the rate of orders of magnitude and so is no longer the bottleneck of supercomputers'design, the problem of their availability is becoming more and more crucial, for the reason that it always restrains them from operating normally, which means that their performance would never exert entirely and sometimes even would discount heavily, unless the availability problem has been coped with properly. In order to improve their availability and reduce the influences on their performance as much as possible brought out by the systems'failure and maintenance events, availability evaluation for the systems is very significative. However, supercomputers'availability can not be evaluated in the same way as common computer systems'availability is usually done, since super computers have many unique characteristics different from the common ones. For this reason, availability evaluation for supercomputers deserves further research in several directions, such as evaluation methods, measurement metrics, the solution toevaluation models, etc. Based on the investigations of researches on availability evaluation for supercomputer systems by far and the conclusion we have drawn including the general principle and its three primary elements of availability evaluation for common systems, work in three aspects has been endeavor in this thesis, which are aiming at overcoming shortages in present researches on availability evaluation for supercomputers and solving problems encountered in evaluating supercomputers'availability directly with general principles and with its elements. They include the following aspects: (1) researches on availability evaluation principles and methods especially for supercomputers; (2) researches on application-oriented availability evaluation metrics that can reflect essential characteristics of supercomputers; (3) study of solutions to the state space exploration problem inevitably encountered while solving the state space models of supercomputers with numerical analysis methods.The contributions of this dissertation include:(1) An evaluation method named as AOHAM (shorted for Application-Oriented Hierarchical Availability Modeling) is proposed for supercomputers'availability evaluation. It's based upon the general characteristics of supercomputer systems and takes multiple different observation subjects into account. By hierarchical and modularized SANs modeling method, AOHAM pictures the relationships between the system behaviors with places or activities shared by different model modules. And with the help of the modeling tool Mobius, multiple requirements from different observation subjects could be satisfied by just one solving process to the integrated model, which reduces much repeated work that should be done when modeling the supercomputer system for multiple observers in the general principle as the common systems were done.(2) Two new availability evaluation metrics have been proposed: Powerful Availability (PA) and Available Power (AP), with their definitions and measurement rules having been stated and deduced in detail. They are both brought out for the reason that measuring how much computing power the supercomputer can provide to the user is much more meaningful than just judging whether it is available at certain time. And the difference between them is that, the former directly measures the computing power that can be provided by the system, while the latter measures the ratio of this power in the system's total computing power. By evaluating a set of simple parameter-variable example models respectively with Powerful Availability and traditional primary availability, we can draw the conclusion from the experimental results that new availability metrics can better reflect supercomputer's essential characteristics, and so they are more suitable for supercomputers'availability evaluation.(3) An automatic distributed state space generation scheme based on MapReduce mechanism has been designed and implemented. Numerical analysis is a much important solution to state space models, which are the most significant method to measuring the supercomputers'availability. Unfortunately however, it would face a crucial problem when the target system is increasing in scale, that is, the state space exploration problem, which baffles the application of state space models in supercomputers'availability evaluation. One important approach to counterwork this problem is to generate the models'state space in parallel under distributed environments. Since the current implementations to this approach have some shortages such as high demands on platforms and programmers, hard to extend in applications, etc., the scheme proposed in this thesis is implemented based on the open Hadoop platform and its MapReduce mechanism, which can automatically parallelize the generation progress of state space. It has been realized in a common distributed environment, and the experiment results show that: (a) it has good solving speed-up ratio; (b) the host platform for the experiment is independent and easy to scale, which is apt to meet the expansion of the simulated system; (c) the implementation of the scheme is ease to use by for common programmers, whether he (she) has the knowledge of parallel programming or not. Therefore, this scheme has a broad and promising application perspective.(4) Two core parts of a certain supercomputer system (the host system and the peripheral system) has been accomplished availability-evaluating, respectively with Powerful Availability and traditional primary availability. When evaluating the host system's availability, PA metrics has been adopted, the system's logic hierarchies have been analyzed and the behavior models corresponding to different hierarchies have been independently set up in SANs. At last, these model modules have been integrated by Mobius into a universal one, the model of the whole system, which are to be resolved once and can fulfill multiple requirements from different observation subjects. While for the peripheral system, its availability has been evaluated with the traditional primary availability metrics since it is of Boolean property in availability. Hierarchical SANs models have also been setup for it, and based on them, serial experiments have been done with some kinds of parameters. The conclusion drawn from these implementation is that which metrics should been chosen, PA or primary availability, is determined by the target system's property. If its availability is of Boolean property, primary availability is equivalent to PA and measure its availability with primary one is more convenient; otherwise, PA should be chosen, for only with PA should the computing power of it be properly shown.
Keywords/Search Tags:supercomputer, availability evaluation, primary availability, powerful availability, model methods, state space generation, MapReduce, SANs, Mobius
PDF Full Text Request
Related items