Font Size: a A A

Research On Availability Evaluation And Measurement Of Complex Computer System

Posted on:2014-08-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y FengFull Text:PDF
GTID:1228330422492400Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Complex computer systems are used in critical industries such as financial services, telecommunications, energy, transportation, aviation, etc, which are related to national economic security and social safety. Complex computer systems not only require powerful transaction-processing capacity, but also require higher availability in order to provide high-speed, continuous, and stable information processing services. Such kind of systems in the event of delays or failures will cause incalculable economic losses, even may lead to negative impacts on society. The research on availability testing of such kind of complex computer systems will make contributions to improve complex computer systems’ availability and be great helpful to the smooth and secure operation of the national economy.Some previous researches have proposed that there exists some patterns of correlation between hardware components or between software faults which may affect system availability as well. But most of the researches were from the theoretical point of view. The discussions on correlation issues often lack of support and less than convincing due to lack of direct evidences of correlation in actual systems. This thesis analyzes a failure record of a bank computer system and a running log of a high-end server, which indicates that correlation between components may exist on both system-level and component-level. In order to make system availability model more accurate, the thesis analyzes the failure distribution of bank computer system failure compared with failure distribution in LANL fault data sets. It’s found that hardware failure time distribution of SMP architecture based computing systems belongs to Weibull family.Complex computer systems used in critical industries often use k-out-of-n system architecture in order to achieve high availability requirements, this thesis focused on the modeling of considering correlation factors. Firstly, the load-sharing k-out-of-n system is modeled using stochastic processes theory, and it’s pointed out that the distribution function of the i-th component failure occurrence time since the i-1-th component failure follows a two-parameter Weibull distribution and there exsits system correlation between residence time in different states. Copula theory is introduced, Gumbel Copula function is used to capture the right tail correlation between system residence time in different states. A components correlation matrix calculation algorithm of k-out-of-n system is proposed, given specified components’ failure sequence.In order to describe the correlation issues between system components in an intuitive way, this thesis discusses a system description model called DRBD which was derived from reliability block diagram by adding dynamic factors. This thesis introduces the advantages of DRBD, and describes some common system strcuctures such as series reliability model, common cause/common-mode fault model, redundancy model and RAID structure model, etc, applying the thinking of DRBD. The system availability evaluating method and procedures based on DRBD are proposed, this thesis also proposes the approach how to transform DRBD models of forementioned sytem structures into GSPN models and make them solved.Traditional availability testing methods followed the online-test way which use multiple target system with the same configuration operating together for some period. But complex computer systems applied in critical industries often have the characteristic of high availability, which costs a long time for online-tracking test to get accurate results. To conquer this problem, this thesis presents a system availability testing method based on MTBF threshold test for k-out-of-n systems, which convers the system-level availability test into availability tests of redundant components. An availability evaluatiing and testing platform is designed and implemented oriented transaction-processing fault-tolerant computer systems, which consists of a fault injection platform, an availabilit test suitkit and an availability test database.A simulated dual-mode application environment is built using high-end server follow the construction of bank business system. Through a series of online tests, it’s proved that the availability evaluation results matched the results announced by official in the same order of magnitude. The availability evaluating and testing platform can judge whether the target system achieves the required level of availability within a relatively short period of time.
Keywords/Search Tags:fault-tolerant computing, correlation analysis, distribution of fault, Copula function, availability evaluating and testing
PDF Full Text Request
Related items