Font Size: a A A

Reliability, Performance And Energy Joint Correlation Modeling And Optimization For Large-scale Complex IT Systems

Posted on:2017-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W QiuFull Text:PDF
GTID:1108330485988422Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of internet, new generation of information technologies(IT) typically represented by cloud computing and big data processing has integrated various infrastructure for resource sharing, which gradually forms a new type of IT systems,i.e., large-scale complex IT systems(LSCITS). Compared with traditional IT systems, it not only needs to efficiently manage large-scale, heterogeneous and complex infrastructure, but also needs to satisfy various requirement of application, particularly, reliable computing, high-performance computing, and energy saving.For realizing optimal scheduling and management of LSCITS from reliability, performance and energy consumption perspectives, it is essential to build theoretical models for precisely analyzing and evaluating these important metrics. However, reliability, performance and energy consumption were usually treated as sperate metrics in most of prior research, which ignores important correlations among these metrics. Meanwhile,scheduling and management technologies in LSCITS also face new challenges, such as ensuring high efficiency of scheduling and management for large scale infrastructure,and developing rational scheduling and management strategies for complicated multiobjective optimization. To solve these critical problems existed in LSCITS, this dissertation systemically studies reliability-performance-energy(R-P-E) correlation models for two typical kinds of LSCITS(i.e., cloud systems and big data processing systems),and designs a novel scheduling and management architecture based on bionic autonomic nervous systems(BANS). The optimization technologies comprehensively considering R-P-E correlation are also developed in this dissertation. The main contributions and innovations of the dissertation include:1. A hierarchical and interacting stochastic modeling method(HISM) is proposed.As for traditional service systems that are migrated to cloud systems, the corresponding R-P-E correlation model is presented based on HISM. A semi-Markov reliability model is first presented in the infrastructure layer, which captures random failures and recovery of physical machines(PM) and virtual machines(VM). The presented reliability model can be used to analyze common cause failures(CCF) of co-located VMs caused by failures of the PM, which is a special kind of CCF only existing in virtualized environments.Then, a performance model based on the queuing theory is also proposed in the application layer, which takes random capacity of available resources as an important input parameter. Overflow failures of the request queue and timeout failures of user requests are analyzed in detail. In the supervision layer, random change of dynamic energy consumption notably affected by random failures and recovery is taken into account when energy consumption modeling. Finally, Markov reward models(MRM) and a Bayesian approach are adopted to evaluate expected performance and expected energy consumption metrics depicting important R-P correlation and R-E correlation, respectively. A novel metric named as performance-energy efficiency ratio(PEER) is also designed to quantify complicated P-E tradeoff. Theoretical results are verified by comparing with simulation results. Experimental results also demonstrate that the PEER metric can effectively contribute to selecting a rational and comprehensive resource assignment strategy.2. According to the presented HISM method, new R-P-E correlation models for private cloud service systems and public cloud service systems are further proposed. A hierarchical recovery mechanism consisting of multiple repair actions is designed to efficiently remove multiple types of failures. Based on this flexible and prompt recovery mechanism, the corresponding Markov model is built for evaluating the reliability of private cloud service systems. As for performance analysis of private cloud service systems,to effectively analyze operational states of the centralized cloud scheduler, which is the most critical element of the private cloud service systems, a new Jackson queuing network model is presented. This performance model comprehensively captures the request parse time of the scheduler and the serving time of VMs. For public cloud service systems, a complicated characteristic of user behaviors that a user request may require multiple VMs is considered when performance modeling. Since not only random failure and recovery but also random resource utilization have significant effects on energy consumption metric, the energy consumption model is connected with the presented reliability and performance models by using the Bayesian approach. Finally, a simulation program is developed for justifying the correction of the presented correlation models. Numerical examples illustrate the change trend of expected performance and energy consumption metrics decided by the decision variable identifying a resource scheduling strategy.3. Another new correlation modeling method based on Laplace-Stieltjes Transform(LST) is proposed for computing-intensive tasks(CIT) of big data processing systems.Since the execution time of CITs directly affects the energy consumed in completing the CITs. The semi-makrov model describing the execution process of CITs takes various realistic factors into account, including a bound on random failure time imposed by perfect task completion time, random failures and recovery of hardware and data processing programs. With the analysis of this model, expected completion time and energy consumption of CITs are derived by using properties of LST and Bayesian theory. As for another typical type of tasks, i.e., data-intensive tasks(DIT), existing in the big data processing system, we systemically analyze complicated decision behaviors for executing DITs, which are composed of task partitioning strategies(PS) and redundant execution strategies(RS). Meanwhile, a computation algorithm is developed for deriving the probability distribution of random execution time of DITs in such a redundant and parallel computing environment. The expected execution time and energy consumption of DITs are further calculated by using the Bayesian method. Numerical examples demonstrates that the proposed models have important theoretical values on analyzing the rationality of PS and RS strategies.4. The corresponding Multi-objective optimization models for LSCITS are further developed based on the presented correlation models. According to types and complexities of decision variables, multiple approaches including direct analysis of Pareto optimal solution sets, convergent searching algorithms, and genetic algorithms(GA), are presented for solving the optimization models. The major innovation in scheduling and management technologies is autonomic and dynamic resource management capability designed according to the spirit of BANS. For achieving such a capability, an optimality distribution map is first established, which depicts the sensitivity in the arrival rate of user requests for remaining a resource assignment strategy as the optimal solution. Then, an autonomic trigger mechanism is realized based on the map for re-assigning resources to fit the dynamic fluctuation of user requests. On the other hand, a genetic algorithm based on the optimality distribution map for searching an optimal request scheduling strategy is also developed, which has a significant effect on improving the convergent speed of the GA. Experiments demonstrate that the scheduling and management mechanism based on BANS not only achieves a satisfied effect on optimizing the expected pure profit metric,but also has notable improvement on searching global optimization solutions for dispatching user requests.
Keywords/Search Tags:reliability, performance, energy consumption, correlation models, bionic autonomic nervous systems(BANS)
PDF Full Text Request
Related items