Font Size: a A A

Study On Collective Anomaly Detection And Optimization Based On Statistical Distance

Posted on:2022-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:J E WuFull Text:PDF
GTID:2518306785475994Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Collective anomaly detection technology refers to the method to determine the anomaly of collective data that is gathered together in some form.For example,the transaction data generated by online stores are gathered together in a unit of time.When studying whether there is abnormal behavior of a merchant,these collective data should be analyzed.Effective analysis and use of collective data can mine potential abnormal behaviors in real activities,which is of great significance to prevent and stop abnormal behaviors.Most anomaly detection algorithms choose Euclidean distance as the similarity measure of calculated data differences,but statistical distance can better reflect the characteristics of different collective data by reflecting the differences of collective data on the data distribution.In this paper,statistical distance is used as the similarity measure to measure the difference between the collective data,and the anomaly detection framework of the collective data is established based on the k-Nearest Neighbor algorithm.In order to further improve the detection performance of the algorithm,optimization research was carried out from the perspective of timeliness and accuracy:(1)Aiming at the timeliness problem of the collective anomaly detection framework based on k-Nearest Neighbor algorithm,it is proposed to use KD tree structure combined with BBF search algorithm to improve the idea of optimizing search space.KD tree is a data structure based on binary tree.For a given data set to be tested,the root node is first determined by the method of maximum variance,and then the remaining data is divided into left and right subtrees according to the idea of binary tree.This method changes the original storage structure of the data and prepares the BBF algorithm to quickly find the k nearest neighbors of the collective data to be tested.BBF algorithm is a search algorithm based on backtracking.It first establishes a priority queue,and then takes the collective data with the highest priority level from the priority queue for backtracking each time.Finally,the search is completed when the priority queue is empty or the maximum number of backtracking is reached.This search algorithm reduces most of the irrelevant backtracking and can effectively improve the query efficiency.Combining KD tree structure and BBF algorithm,k nearest neighbors of the collective data can be found quickly in a short time.(2)Aiming at the common problems of false positives and missed positives in anomaly detection algorithms,a model based on Reverse k-Nearest Neighbor algorithm is proposed to improve the search accuracy.In anomaly detection algorithms,false alarm rate and missing alarm rate are both important indicators to measure algorithm performance.Low false alarm rate and missing alarm rate are the embodiment of superior algorithm performance.There are two reasons for the low detection performance of the algorithm.One is the limitation of the detection performance of the algorithm itself.Second,the mutual interference between outliers.In order to improve the performance of the algorithm,the Reverse k-Nearest Neighbor model is studied to reduce the interference between outliers,so as to further improve the accuracy of the algorithm.This algorithm is mainly applied to the collective anomaly detection framework based on the k-Nearest Neighbor algorithm after the execution of the identified anomalies using the Reverse k-Nearest Neighbor algorithm to reverse filter,so as to reduce the false positives and missed positives caused by some interference.Through experiments on real transaction data sets generated by online trading stores,it can be seen that the proposed collective data anomaly detection method based on statistical distance can detect abnormal behaviors efficiently and accurately.Compared with the unimproved algorithm,the time-improvement ratio of the improved k-Nearest Neighbor optimization algorithm based on KD tree increases by two times.The method of Reverse k-Nearest Neighbor filtering can reduce the false positive rate by 1%when the comprehensive evaluation index F1 value is as high as 96%.In addition,through experiments in multiple datasets,it can be seen that the algorithm proposed in this paper has good applicability to detecte anormaly in other fields.
Keywords/Search Tags:statistical distance, k-Nearest Neighbor(kNN), KD tree, BBF algorithm, Reverse k-Nearest Neighbor(RkNN), collective anomaly detection
PDF Full Text Request
Related items