| The microservice architecture provides a scalable,agile and efficient development paradigm for large-scale software development,while stability and reliability are the core goals of microservices.The large scale,complex dynamic relationships and multilevel of microservices bring many difficulties and challenges to the operation and maintenance of microservices.This paper focuses on fault detection and root cause location in microservice system.Operations and maintenance personnel can quickly and accurately detect and locate faults through them.Failure detection in microservices is faced with frequent iterations,large amounts of data and complex service relationships.Existing methods often rely on too much monitoring and tracing data,which increases the overhead of anomaly detection and location.AutoEncoder solutions based on recurrent neural networks are still not effective in training and inference,especially when long-sequence reconstruction is involved,they do not take into account the influence of container-to-container relationships.To solve these problems,this paper proposes a reconstruction model based on transformer,named Transformerbased Anomaly Detector(TAD),modeling the temporal features and dynamically capturing container relationship using multi-head attention mechanisms and sandwich structures.TAD uses readily available container performance metrics,making it easy to implement in already running container clusters.Evaluations are conducted on a sock-shop dataset collected from real microservice environment and a publicly available SMD dataset,the performance of anomaly detection,the lantency of anomaly detection and the effect of anomaly container location are optimized.Further analysis shows that our method has excellent effect and is useful for anomaly container detection and locationThis paper divides microservices into three layers:service layer,resource layer and metric layer.Root cause positioning is facing the challenge of spanning multiple levels to locate the metric.This study presents a multi-level root cause location method,which ultimately locates potential causes by tracing and metric data.In addition,a differential call delay is proposed to decompose the tracing chain,and a Breadth First Search-Based potential failure discovery algorithm is used to locate potential failure areas from the service layer to the resource layer.Combining the three characteristics of root cause,the multifactorial root cause score is proposed,and the root cause metric is located by sorting the metrics in the potential failure area.This method works best on the SockShop dataset and the AIOPS 2020 challenge dataset which collected in the real microservice environment. |