Unsupervised Performance Anomaly Management for Production Cloud Environments

Posted on:2016-02-24

Degree:Ph.D

Type:Dissertation

University:North Carolina State University

Candidate:Dean, Daniel Joseph

Full Text:PDF

GTID:1478390017477106

Subject:Computer Science

Abstract/Summary:

Ensuring satisfactory application performance in multi-tenant cloud environments is a challenging task. Despite extensive testing, many performance problems are missed during development and carried over into production. These problems can come from a variety of sources such as environment factors, interference from other co-located applications and software bugs.When a problem manifests in the production environment, there can be major financial penalties for all parties involved.;Due to the complex and distributed nature of production systems, these performance problems are inevitable. Instead of reacting to a production-run problem when it occurs, we instead propose a framework designed to proactively manage these issues.;First, we present UBL, a tool for black-box performance anomaly prediction in cloud environments. UBL is able to predict when a performance anomaly will occur by monitoring black-box metrics such as CPU usage, memory usage, and network usage using an unsupervised artificial neural network called the self organizing map (SOM). Our results show that UBL is able to predict anomalies with higher accuracy than other alternative approaches such as principle component anaylsis. Additionally, UBL is lightweight, imparting negligible overhead to the tested systems.;Second, we present PerfCompass, a tool for online fine-grained fault localization using system call traces. When a system experiences a performance problem, as predicted by UBL or a similar tool, PerfCompass is triggered to trace the system calls for the failing application. We use this trace to first localize the problem as either an internal software bug or external environmental cause using a novel two-phase differentiation scheme. The first phase looks at the percentage of threads experiencing a significant increase in execution time or frequency while the second phase considers how long it takes each thread to be affected by the problem. Our results indicate we are able to correctly localize all 24 problems we tested while imparting an average of 2.1% runtime overhead to the server.;Third, we describe PerfScope, a fine-grained root cause inference tool. When a problem is localized to be an internal software bug, PerfScope provides developers with a ranked list of suspicious functions for inspection. PerfScope uses a combination of clustering along with an unsupervised data mining technique called frequent episode mining on large system call traces. Our results show that PerfScope is effective, providing developers with a short list of candidate cause related functions to examine in 12 real software bugs while imparting an average of 1.8% runtime overhead to the tested server applications.;Finally, we present HSR, a hybrid static-runtime analysis tool. HSR combines rule-based static analysis with runtime diagnosis hints (i.e., cause related functions identified by PerfScope) in order to reduce the number of non-bug related functions (i.e., false positives) developers need to examine. Our results indicate HSR is able to reduce the number of false positives by up to 98% compared to static approaches and up to 91% compared to runtime approaches while also covering root-cause related functions.

Keywords/Search Tags:

Related items

1	Study And Implementation Of Methods For Tracing Runtime Device Drivers
2	Runtime resource management in concurrent systems
3	Random keys genetic algorithms scheduling and rescheduling systems for common production systems
4	A Workflow Oriented Computation Model And Its Distributed Runtime Enviroment
5	The Research Of Authentication Protocol In Cloud And The Hardware Implementation Of Related Algorithm
6	Machine learning for machines: Data-driven performance tuning at runtime using sparse coding
7	Runtime Performance Verification And Abnormality Analysis Of Robot Software
8	Optimization For The Production Planning Problem In Huawei's Supply Chain
9	Research On The Runtime Of R Language Based On Large Scale Parallel Computation
10	Related Research On Perceptual Grouping Problems