Font Size: a A A

Online Performance Anomaly Prediction and Prevention for Complex Distributed Systems

Posted on:2013-03-30Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Tan, YongminFull Text:PDF
GTID:1458390008965557Subject:Computer Science
Abstract/Summary:
Real world distributed systems (e.g., cloud computing infrastructures, enterprise data centers, massive data processing systems) have become increasingly complex as they grow in both scale and functionality. However, such complexity makes these systems vulnerable to performance anomalies caused by various faults such as resource contentions, performance bottlenecks, software bugs, and hardware failures. It is a daunting task for system administrators to manually keep track of the execution status of many distributed hosts all the time in order to search anomaly root causes and correct performance anomalies. Therefore, it is imperative to develop automated system anomaly management schemes to achieve robust distributed systems with a minimum requirement for human intervention.;This dissertation focuses on exploring the key techniques for building robust distributed systems. It includes three studies on system performance anomaly prediction and one study on scalable and resilient system monitoring. The following are the key contributions of this dissertation:;First, we present a set of online anomaly prediction models that aim at raising advance alerts prior to anomaly occurrences so that they can provide a window of opportunity (i.e., lead time) for predictive anomaly prevention and alleviation. We propose integrated prediction models that combine the attribute value prediction (Markov chain model) and the statistical classification methods (naive Bayesian classifier and tree-augmented Bayesian networks). We also propose a decision tree based prediction model that introduces an additional alert state other than normal and anomaly states to achieve advance anomaly prediction. We further present comprehensive measurement studies to quantify the predictability of different real-world system performance anomalies. We observe that those real system anomalies do exhibit predictability and our anomaly prediction models can achieve high prediction accuracy with generous lead time and low prediction overhead.;Second, applications running under dynamic execution contexts (e.g., changing input workload) may exhibit context-dependent behaviors that can cause a monolithic prediction model to make wrong predictions. We address this problem by adding context-aware and self-evolving features into the online prediction models. We use a hierarchical clustering algorithm to discover different system runtime execution contexts. We then characterize system normal and abnormal behaviors under these different contexts by building an ensemble of prediction models trained from conflict-free data. During runtime, we predict the current context based on the context evolving patterns and employ the prediction model for the current context to achieve high prediction accuracy.;Third, a complete solution of predictive system performance anomaly management requires not only accurate anomaly prediction models but also anomaly prevention to steer the system away from the potential abnormal state. To this end, we present a novel predictive performance anomaly prevention system that integrates online anomaly prediction and virtualization-based prevention techniques. Our system can raise advance anomaly alerts and perform coarse-grained anomaly cause inference to pinpoint the faulty application components and infer the most related system metrics. Based on those prediction results, our system uses virtualization techniques to perform virtual machine perturbations (e.g. elastic resource scaling, live virtual machine migration) to prevent the impending performance anomalies.;Fourth, we develop an image-based resilient self-compressive monitoring system for large-scale hosting infrastructures. We model snapshots of the monitored distributed system using a sequence of system images and apply lightweight online reference block search algorithms to compress the distributed monitoring data. Our compressive monitoring system is also failure resilient, which can tolerate host and network failures that are common in real-world hosting infrastructures.
Keywords/Search Tags:System, Prediction, Anomaly, Distributed, Prevention, Online, Infrastructures, Data
Related items