| With the rapid development of cloud technology,microservices,with its low coupling and high scalability,are gradually becoming the mainstream technical architecture for IT systems.Meanwhile,due to the development and improvement of Docker and other related technologies,many large enterprises have chosen to use container cloud platform as the infrastructure architecture for system operation.As the microservices running in containers become more and more diversified and their structures become more and more complex,once a local exception occurs in the service,it will spread and trigger a large number of concurrent alerts.In order to ensure the normal operation of the microservice system,it is necessary to detect anomalies in many microservices and to quickly locate the root cause of concurrent abnormal services.In order to quickly and accurately locate the root cause services that cause system anomalies,root cause analysis techniques are studied and root cause location algorithms are designed based on container microservice scenarios.The algorithms model the deployment relationship between hosts and services and the invocation relationship between services,and construct an anomaly propagation graph to simulate the propagation of anomalies among services.It detects the anomalies of running services by collecting the relevant metrics of the system and services,correlates the anomalies of microservices with their resource utilization using a random wandering model,and ranks the root cause services that may cause the current anomaly and outputs a root cause list without requiring expert knowledge.An intelligent operation and maintenance system is also designed and implemented,in which the resource management module is used for information configuration and display of host and service resources of the system,the unified monitoring module collects and monitors multidimensional indicators of hosts and services online,and the operation and maintenance management module realizes the anomaly detection and root cause location of services.After testing,the simulation experimental results of the proposed root cause location algorithm show that it can guarantee the average accuracy of the output results in a short time consuming situation.At the same time,the intelligent operation and maintenance system can collect relevant monitoring indicators of hosts and microservices in real time to ensure the stability of system operation while conducting online anomaly detection and root cause location. |