| With the rise of the concept of micro-services and the increasingly mature container technology,more and more Web applications have evolved from the original single architecture to the complex micro-service architecture.Today’s microservice system is usually composed of tens to thousands of microservices,and the call relationship is complex.In other words,the failure of the micro-service system can create significant difficulties and challenges in maintaining the application system.The failure of the micro-service system will cause huge losses,mainly including economic losses and loss of user satisfaction.Therefore,when an exception occurs in the microservice system,it is necessary to quickly locate the source of the exception.However,it is very difficult to carry out efficient and accurate exception detection.A large number of underlying services,complex invocation relationships between services,and data sets are difficult to obtain.In recent years,AIOPS-intelligent operation and maintenance came into being based on machine learning and committed to more efficient fault location.However,there are some shortcomings in some existing methods.In anomaly detection,anomaly detection based on call chain often faces the problem of dimension explosion of vector,which will directly affect the accuracy of the model.And infer the calling relationship between services through the correlation of KPI indicators between abnormal services,which may lead to inaccurate root cause positioning.In terms of root cause location,most of the existing methods rely on the improved PageRank algorithm or spectral method to locate the root cause,and only rely on the call relationship or the statistical method to infer the root cause,which is often inaccurate.This thesis attempts to achieve more accurate anomaly detection and root cause location by combining multi-source monitoring data and a new anomaly detection and root cause location method.About exception detection,based on multi-source data,we combine the exception detection of the service invocation chain with the exception detection of the service itself.For the exception detection of call chain,we propose a more effective way to construct call chain vector,which avoids the dimension explosion problem to a certain extent.The anomaly detection of services includes two parts.First,the anomaly detection based on the service KPI timing indicators.For this purpose,we propose a multi-dimensional temporal anomaly detection model GRU-VAE based on VAE model improvement,and then use spectrum method to jointly complete the anomaly scoring of services.The PageRank algorithm is improved by our root cause localization method.The innovation is that the previous PageRank algorithm did not dynamically consider the weight of nodes and the probability of walking between nodes.Especially in the case of multiple root causes,the traditional PageRank algorithm performs poorly.We combine the service level exception detection score with the service call chain level exception score to initialize the node weight(report vector)and the transfer probability between nodes in the service call graph.This makes the improved PageRank algorithm not only take into account the abnormal call relationship between services,but also combine the abnormal degree of the service itself and the abnormal walk probability between services to achieve more accurate root cause positioning.Finally,we use a simple explanatory force formula to achieve root cause localization at the Pod level.Experimental results show that our anomaly detection and root cause localization methods cover more comprehensive issues and improve accuracy and interpretability compared to other methods. |