Font Size: a A A

Research On Strategic Management Of Fault Diagnosisand Identification In Cloud Services Infrastructure

Posted on:2019-07-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Full Text:PDF
GTID:1368330590972838Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The infrastructures of large cloud services experience frequent faults,which are major contributors to their total management costs,and lead to Service Level Agreement(SLA)violations of the hosted services.In recent years,there has been an unprecedented growth in cloud services infrastructure from various key cloud providers,including IBM,Amazon,and Google.Some of the unique characteristics that make cloud computing services so attractive include infinite resource pool availability,the flexible economy of scale,multi-tenancy,and self-organization,which differentiate such services from traditional distributed systems(e.g.,data centers and grids).Despite all the promises and benefits provided by cloud computing services,there are also challenges.Complex systems have appeared as a popular computing paradigm,which allows workloads to automatically scale in response to changes in demand and the virtualization of resources in the cloud service.This enables elasticity by continually configuring the virtual resources and allocation of physical workloads,which increases the probability of faults and anomalies,especially in the provision of infrastructure as services(IaaS).The management of fault diagnosis and identification-related issues within cloud service infrastructures is critical,and this topic forms the main focus of this dissertation.This dissertation addresses the four following specific topics on the strategic management of fault diagnosis and identification in cloud services infrastructure:(1)Fault diagnosis and identification have attracted extensive attention because of their importance in the fault management framework for cloud infrastructure,even though fault diagnosis becomes more difficult due to the increased scalability and complexity in a heterogeneous environment for a virtualization technique.Most fault diagnosis and identification methods are based on active probing techniques that can be used to rapidly and precisely detect faults.However,most methods suffer from the limitations of traffic overhead and diagnosis of faults,which lead to a reduction in system performance of cloud services,such as IaaS.(2)Due to the massive amounts of data,monitoring is a special challenge.The monitoring of large complex systems requires high accuracy,low latency,and near-real-time analysis for fault detection and anomalies.Optimization is also required for corrections by running representative largescale dataset processing applications.(3)The diagnosis and self-healing of anomalies/faults are important operations for cloud services infrastructure.Automation for fault detection and real-time self-healing are required.(4)In IaaS,there are four effective measurement criteria that determine the efficacy of troubleshooting: priority,fault probability,risk,and the duration of the configured action.Some research groups are aimed at determining how to monitor collections,develop classifiers,and analyze attributions of metrics,rather than individual metric thresholds,by extending the diagnosis of faults into troubleshooting.This dissertation addresses the research on the strategic management of fault diagnosis and identification in cloud services infrastructure.Effective methodologies are proposed,and the underlying motivations and solutions are explored.Exhaustive evaluations were conducted through a comprehensive empirical analysis and new quantitative approaches,and an infrastructure was established for future research.Four separate but inter-related achievements were realized:(1)First,we have proposed and developed a new hybrid model,named accelerated fault diagnosis and identification(AFDI),to monitor various system metrics for VMs and physical server hosting based on the severity of fault levels and anomalies,as well as to investigate fine-grained fault-tolerance algorithms.Based on these findings,we propose a new methodology for constructing a model that optimizes the performance of real-time monitoring and improves prediction accuracy based on the Hadoop MapReduce and Apache Spark platform.(2)Next,we proposed a new method that diagnoses faults/anomalies by analyzing and classifying them as based on qualitative metrics.The distributions of the faults/anomalies,rather than their individual metric thresholds,which are determined by machine learning algorithms,are employed to create time-series diagnostic methods to detect and classify anomalies/faults during runtime,thereby estimating the influence of each self-healing system component on the system functionality,and to realize the high availability of services.(3)We have proposed a new theoretical approach algorithm to construct the steps of a model for fault detection and remediation(troubleshooting)by combining Naive Bayes Classifier(NBC)with multi-valued decision diagram(MDD)and an influence diagram(ID)to structure and manage fault troubleshooting on cloud anomaly detection.The practical consideration for implementing this approach is to provide a decision-theoretical methodology for modeling the construction steps for fault troubleshooting of cloud services infrastructure.(4)Finally,we have proposed an Apache Spark-based bottleneck troubleshooting performance framework for IaaS,which we name CloudPT.CloudPT has many advantages: it has high-efficiency detection;it has a unified,all-around feedback loop to collaborate with the management of cloud-ecosystems;and it includes a troubleshooting performance test.The objectives of CloudPT are to monitor collections,develop analysis,and classify the attributes of measurements,as opposed to the individual metric thresholds,by extending the fault troubleshooting.
Keywords/Search Tags:Cloud services infrastructure, Fault diagnosis, Self-healing, Big datasets, Apache Spark, Performance testing
PDF Full Text Request
Related items