| As the cost of cloud servers decreases,the large capacity and strong performance of cloud servers have made more and more enterprises and organizations choose to deploy their applications on cloud servers.However,attacks against cloud servers have become more frequent and varied,and traditional methods such as setting up rules by human experts have become difficult to cope with.Machine learning algorithms can adapt to different data types of anomalous behavior patterns,can be analyzed through a large amount of data to automatically discover anomalies,which has a greater advantage compared to traditional methods.Data pre-processing(in this paper,data pre-processing focuses on dimensionality reduction methods)is generally performed before classification by anomaly detection algorithms to improve training efficiency,but traditional data reduction methods have the problem of losing useful feature dimensions,and this problem leads to poor classification results.For the anomaly detection algorithms in cloud environment,the industry widely used are clustering algorithm,support vector machine algorithm,isolated forest algorithm,etc.However,these algorithms have poor generalization performance and low classification efficiency for large-scale data scenarios in cloud environment,in addition,these algorithms use a single feature space,resulting in general classification effect.To address the above problems,firstly,this paper proposes a dimensionality reduction strategy based on five feature selection methods in combination with machine learning algorithms,which can reduce useless dimensions and retain useful information as much as possible compared with traditional dimensionality reduction methods.Second,after data dimensionality reduction,the anomaly detection algorithm selected in this paper is based on integrated learning machine learning algorithm,but integrated learning has certain problems,including slow classification efficiency and poor classification due to data imbalance(especially the log data set generated by cloud environment).Therefore,in this paper,based on the Extreme Boosting Based Outlier Detection(XGBOD)integrated learning algorithm,we optimize the data imbalance problem of the anomaly detection log dataset generated in the cloud environment,generate a new data space,optimize the data distribution,and at the same time,improve the efficiency and accuracy of classification,and validation of effectiveness is obtained by evaluating relevant metrics.The work of this paper is as follows:(1)To address the problems of traditional data dimensionality reduction methods,by combining with machine learning,we design a dimensionality reduction strategy based on five feature selection methods to evaluate and filter data feature dimensions,and reduce redundant information as much as possible while leaving useful feature dimensions.The experiments on the Java Virtual Machine(JVM)dataset used in this thesis can be seen in several dimensionality reduction methods after dimensionality reduction of the dataset using Isolation Forest(IForest)algorithm training validation,using traditional Principal Component Analysis(PCA)Analysis(PCA),the accuracy is 83.8%,the accuracy is 79.6%after using Linear Discriminant Analysis(LDA)algorithm,and the accuracy is 87.8%after using this paper’s dimensionality reduction method,and the prediction accuracy is higher than the The accuracy of the prediction is higher than the previous two.(2)To improve the unsupervised feature engineering part of XGBOD,the unsupervised feature engineering output of XGBOD is deeply heterogeneous by adding different types of machine learning algorithms to the unsupervised feature engineering part,and the experiment is conducted on the JVM dataset in cloud environment after using the dimensionality reduction method of this paper.degree and richness of the data distribution space constructed by the original algorithm.It is experimentally verified that the classification accuracy of the improved XGBOD is 94.9%and the F1-score(F1-Score)is 94.9%,which are higher than the classification accuracy(93.8%)and F1-Score(93.8%)of the original XGBOD algorithm,respectively,from which it is concluded that the deep heterogeneous unsupervised feature engineering output part improves the feature space richness and thus further improve the classification metrics such as accuracy of the algorithm.(3)After improving the feature space richness,the algorithm still has certain efficiency problems.In this paper,based on the deep isomerization experiments,we propose the Lightweight and Deep Isomerization Extreme Boosting Based Outlier Detection(LDI-XGBOD),which introduces a lightweight gradient boosting tree algorithm to replace the original predictive classification algorithm of XGBOD.The experiments show that,under the same data set,the training time of the improved algorithm LDI-XGBOD is reduced from 1343 minutes to 965 minutes compared with the XGBOD algorithm that only performs deep isomerization.The efficiency is improved and the classification indexes are also improved,with the accuracy rate increasing from 94.9%to 96.2%,the F1-Score index increasing from 94.9%to 96.2%,and the AUC index increasing from 0.83 to 0.87.Compared with the widely used K-Means algorithm and IForest algorithm,the accuracy rates are 88.9%and 87.8%,respectively,which are lower than the relevant indexes of LDI-XGBOD.According to the above experimental verification,the dimensionality reduction method proposed in this paper outperforms the traditional dimensionality reduction method and effectively retains useful information,and the LDI-XGBOD algorithm proposed in this paper has strong practicality in terms of classification accuracy and other indicators not only better than XGBOD,but also better than other widely used anomaly detection algorithms. |