| With the development of the Internet,in order to support financial services with high concurrency,low latency,and high reliability,distributed systems are widely applied by security company.However,failures in the distributed complex system are inevitable,such as network communication problems,software bugs,program performance problems,and so on.Once a fault occurs,it often affects the service and even leads to the overall unavailability of the service,which next causes economic losses.What’s more when an fault occurs in the system,due to the complexity and uncertainty of the system,it is difficult for engineers to locate the root cause of the problem immediately.Therefore,system fault avoidance and operation are very important.In this context,in order to improve management efficiency,Gartner proposed Intelligent Operation and Maintenance(AIOps).It directly or indirectly enhance the functions of IT operations by big data,modern machine learning and other analysis techniques.In AIOps,anomaly detection and fault root cause location are essential problem based on all kinds of the monitoring data.In this background,we novel introduced a two-layer architecture for failure prediction based on high-dimension monitoring sequences.Specifically,the first layer generates anomaly scores based on high-dimensional monitoring data of a given component by a real-time unsupervised anomaly detector Ex Pose.Based on anomaly scores,the second layer employs random forest,one of the successful ensemble classification methods,to predict whether nodes will fail within a given time interval.By experimenting,this method could predict faults three hours in advance,but could not analyse the root cause of fault intelligently.In addition,we explore how to locate root cause of fault based on monitoring traces and machine metrics.This article creatively proposed a three-step locator.Firstly,it detects anomaly on business indicators online to judge whether a fault occur.Then it calculates the time distribution change score of call traces and machine metrics change degree.These mutation features would be feed into random forest model to predict root cause.The experiments have already showed the method is precise and fast.All in all,this paper effectively solves two practical problems in the field of AIOps based on data,improves the efficiency of engineers’ operation,and helps enterprise smart operation management. |