Font Size: a A A

Research On Reliability Modeling And Optimization In Large-Scale Cloud Computing System

Posted on:2022-04-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:S MengFull Text:PDF
GTID:1488306524470574Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,large-scale cloud computing system has become a support platform for big data,Io Ts,AI,and other applications.With the rapid increase in its scale and complexity,various hardware and software failures have become common phenomena.Multiple failures need to be found and repaired in time.Simultaneously,the impact of cloud computing system architecture with increasing complexity on reliability is becoming increasingly prominent(such as evaluation difficulties,operation,maintenance inefficiency,etc.)and has been widely concerned by academia and industry.How to quantify the complex cloud computing system's reliability characteristics and guarantee and effectively has become a critical problem restricting the cloud computing industry's sustainable development.In the existing research,reliability is often studied as a single index while ignoring the impact of scale,performance,quality of service,energy consumption,and other indicators on the correlation between reliability.This is detrimental to accurately grasp the severe impact of reliability reduction on the large-scale cloud computing system and not conducive to design feasible reliability optimization schemes.Simultaneously,largescale cloud computing system covers a broader range of failure types,which makes the correlation analysis of reliability and other indicators(such as energy consumption)more difficult.More accurate and useful modeling and analysis methods and more flexible optimization techniques are needed to realize correlation optimization's reliability guarantee function.In practical applications,large-scale cloud computing systems need to provide multilevel services.The access of a large number of users,diversified applications,parallel computing capacity requirements,and regular multi-type failures all pose severe challenges to the system reliability,service reliability,and business reliability of cloud computing systems.Unlike the traditional IT system,large-scale cloud computing systems have different technical characteristics,such as logical virtualization,dynamic integration of resources,and flexible migration of applications,making the cloud computing system have more complex failure correlation phenomena and more flexible fault-tolerance technology.In this case,the traditional reliability evaluation models or analysis technologies cannot be directly applied to a large-scale cloud computing system.Accordingly,in the large-scale cloud computing system's reliability optimization,one must fully consider the critical impact of these different system structures,technical characteristics,and functional indicators on reliability evaluation and optimization.The dissertation takes the large-scale cloud computing system as the research object,takes the reliability modeling and optimization of a large-scale cloud computing system as the research content,takes the reliability association modeling and optimization based on cost constraints,the reliability modeling and optimization oriented to cloud services,and the reliability modeling of big data operations based on fault-tolerance technology as the research focus,respectively from the three aspects of system,service,and application This dissertation studies the reliability modeling and optimization of a large-scale cloud computing system from the perspective of the system,service,and applications.The research covers the system structure,fault types,fault tolerance technology,and application characteristics of the large-scale cloud computing system.The dissertation aims to comprehensively evaluate the reliability correlation characteristics and continuous optimization of the large-scale cloud computing system.The main research work of this paper consists of the following parts.(1)Aiming to analyze reliability-energy consumption correlation modeling,the dissertation proposes a set of optimization strategies for reliability guarantee and energy efficiency improvement based on cost constraints.The strategies use fault tree modeling,backup optimization technology,considers the common cause failure of physical node and virtual machine,and realizes the optimization of reliability and energy efficiency.Firstly,based on the Reliability Association model,the dissertation proposes the design method of cost constraint,and then proposes the reliability assurance framework of cost constraint,which integrates the fault tree analysis method,hot spare policy,and cold spare policy of virtual machines;secondly,it proposes the resource scheduling algorithm of joint optimization of reliability and energy,which can dynamically guarantee the reliability of the whole system through virtual machine migrations;finally,through the integration of Google Cluster Trace benchmark in the experimental environment,the cost constraint is used to find the optimized number of physical nodes in the cloud infrastructure to optimize the energy efficiency cost further.(2)In view of the emerging cloud service mode,in the case of multi-user and multiservice type,the dissertation reasonably divides the cloud service process into request processing stage and request execution stage.In the request processing stage,the queuing theory analyzes the request timeout fault and requests overflow fault.In the request execution phase,the k-out-of-n system modeling method is used to analyze the system service process under hot spare comprehensively.Based on these reliability analyses,this chapter further realizes the dynamic optimization scheduling technology in the request processing stage and the fault self-repair technology in the request execution stage.Dynamic optimization scheduling technology triggers dynamic scheduling mechanism independently based on the change of request arrival rate,and the triggered scheduling behavior not only maintains the service reliability above the specified level but also avoids unnecessary waste of resources;fault self-repair is based on the running state monitoring of the in-service virtual machines,and autonomously performs anomaly detection triggers fast repair behavior when finding possible faults,and improves selflearning according to the changes of reliability after the repair,so as to improve the guaranteed effect of repair behavior on reliability continuously.Finally,the large-scale cloud computing system can effectively maintain its service reliability in the complex and dynamic service environment.(3)In the essential application scenarios of big data processing,the reliability modeling for different types of big data jobs and fault-tolerance technologies is systematically studied.Firstly,for big data jobs with periodic tasks,an executable model using a checkpoint fault tolerance mechanism is proposed.The model takes big data job as a unit,analyzes the random hardware failure,software failure,and recovery action in the process of job execution in detail,and uses Markov random process,Laplace-Stieltjes transform,and other mathematical methods to realize the quantitative evaluation of the executability.Secondly,for the big data jobs with parallel computing requirements,the fault-tolerant technology of redundant execution is adopted.Aiming at the problem that it is challenging to analyze real-time redundant parallel computing's complex topology,a general method is proposed to divide the whole execution tree into multiple minimum Job spanning trees and analyze them.This method is based on the complex problem that the elements of multiple minimum Job spanning trees cover each other.Finally,the whole big data job's reliability evaluation is obtained by combining the Bayesian theory and the inclusion-exclusion principle.
Keywords/Search Tags:Large-scale cloud computing system, reliability, modeling and optimization, fault tolerant, Big Data
PDF Full Text Request
Related items