| In intelligent operation and maintenance,fault discovery and fault diagnosis play a very important role.Fault discovery is the action of the operation and maintenance system to detect the fault after the fault occurs in the system operation,and fault diagnosis is the process of analyzing and diagnosing the root cause of the fault.The two scenes of fault discovery and fault diagnosis are the relationship between upstream and downstream.Fault discovery throws the fault to downstream fault diagnosis for root cause analysis and location.With the emergence of massive operation and maintenance data,manual operation and maintenance has been far from meeting the current needs.At the same time,with the rapid update,iteration and complexity of the enterprise system architecture,the customized operation and maintenance approaches are no longer applicable.In addition,there is the scarcity of operation and maintenance labels.Therefore,how to effectively use machine learning and big data technology to realize the transformation from manual to intelligent operation and maintenance is the focus of current research.In order to improve the accuracy and efficiency of fault discovery and fault diagnosis,this paper studies the multivariate time series anomaly detection method and anomaly pattern clustering method to provide effective solutions and support for these two scenes.Firstly,aiming at fault discovery,this paper studies the anomaly detection of multivariate time series,and proposes a Prediction-Augmented Auto Encoder(PAAE).Considering the lack of effective constraints in the hidden space during the reconstruction of multivariate time series,the hidden space prediction module is designed to enhance the reconstruction ability by predicting the hidden space variables.At the same time,skipped connection is used to preserve the global and local information in the encoding process,so as to improve the anomaly detection performance.Finally,experiments and analysis are carried out on four widely used multivariate time series datasets,SML,SMAP,SMD and Swat,which reflect the excellent performance of PAAE.Ablation experiments and parameter sensitivity experiments are carried out to prove the robustness and stability of PAAE.Secondly,aiming at fault diagnosis,this paper studies the abnormal root cause analysis,and proposes a Framework of Clustering Abnormal Patterns(F-CAP).In the real scenario of operation and maintenance,using the idea that the same abnormal events will have similar abnormal patterns,cluster analysis is carried out on the abnormal events,so that the operation and maintenance personnel can only do one operation and maintenance operation on the same cluster,so as to effectively reduce the labor cost of operation and maintenance.At the same time,multi-source operation and maintenance data such as log,trace and metric data are introduced to enrich the observation field of the model and significantly improve the performance.In addition,in order to solve the problem of fast iteration of system architecture update,an incremental learning log template,trace and root cause clusters knowledge base are constructed.Finally,on the Miscro SS dataset,the effectiveness of F-CAP is verified by comparing the performance of benchmark methods and ablation experiments,and a case study is used to illustrate the applicability of F-CAP and its ability to reduce the cost of operation and maintenance. |