Font Size: a A A

Anomaly Detection And Diagnosis For Metrics In Cloud Services

Posted on:2022-08-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:M H MaFull Text:PDF
GTID:1488306746956679Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increasing development of cloud services,it is of vital importance to ensure the reliability of cloud services for a better user experience.Therefore,each component of cloud services is under consistent monitoring and collecting metrics.Specifically,anomaly detection on these metrics is adopted,including univariate time series and multivariate time series anomaly detection.Nevertheless,operators often conduct software changes on cloud services,which may lead to significant data distribution changes of metrics,namely the concept drift.Previous anomaly detection approaches cannot handle the concept drift and result in performance loss.Besides,it needs to perform root cause diagnosis based on a large number of metrics after anomaly detection.Diagnosing root causes needs to involve the specific domain knowledge,and be interpretable to operators for decision making.For example,database administrators care about intermittent slow queries in the cloud database.To tackle the challenges of software changes and interpretability,this dissertation proposes three systems for anomaly detection and root cause diagnosis in real-world cloud services.The main contributions are summarized as follows:(1)Adaption for concept drift in anomaly detection framework: To the best of our knowledge,this dissertation is the first one to identify the problem of robustly and rapidly adapting anomaly detectors to the concept drift.We design and implement a framework called Step Wise,which contains a concept drift detection method that does not require parameter tuning and a robust linear model for concept drift adaption.We have deployed a prototype of Step Wise in the cloud services of Sogou Search.Our evaluation shows that Step Wise improves the average accuracy by 206% for many widely-used anomaly detectors over a baseline without any concept drift adaption.The adaption lag is only about six minutes.(2)Jump-Starting multivariate time series anomaly detection approach: We conduct an empirical study to identify the problem of long initialization time of deep learning based multivariate time series anomaly detection approaches,which cannot tackle the concept drift caused by software changes.We propose Jump Starter,which incorporates compressed sensing for the first time in multivariate time series anomaly detection.We also design two major components in Jump Starter,i.e.,shape-based clustering,and outlier-resistant sampling.Using the real-world datasets from cloud services of Tencent and Byte Dance,Jump Starter achieves a good accuracy of 94.1% with a short initialization time of twenty minutes.(3)Root cause diagnosis framework of intermittent slow queries: To the best of our knowledge,this dissertation is the first one to identify the problem of Intermittent Slow Queries in the cloud database.With the help of domain knowledge,we design an interpretable diagnosis framework called i SQUAD,which contains four core components,i.e.,anomaly extraction,dependency cleansing,type-oriented pattern integration clustering,and Bayesian case model.We have deployed a prototype of i SQUAD in the Alibaba cloud database service.i SQUAD helps database administrators diagnose root causes of intermittent slow queries with an accuracy of 80.4% and a short diagnosis time.
Keywords/Search Tags:Anomaly Detection, Root Cause Diagnosis, Concept Drift, Cloud Database
PDF Full Text Request
Related items