Font Size: a A A

Study On Outlier Detection Algotithm And Optimization Of Multi-dimensional And Multi-source Data

Posted on:2020-04-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y ShouFull Text:PDF
GTID:1488305975498444Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The purpose of anomaly detection is to quickly and accurately detect abnormal data in the data to be detected that does not conform to the normal behavior pattern from the complex data environment.In the field of data mining theories and algorithms,this paper studies the three aspects of anomaly and outlier detection,privacy preservation outlier detection and data stream outlier detection based on multi-dimensional and multi-source data.On the one hand,the methods and strategies of outlier detection for big data are studied.On the other hand,the optimization method of anomaly detection and parallel processing algorithm are studied to provide technical support for users to analyze and understand the data deeply.The whole research includes the following studies.Firstly,aiming at the problem of anomalies and outliers in datasets,K-distance,reachable distance,local density and multi-dimensional attribute clustering methods are studied.In this paper,the measurement method of clustering dissimilarity degree is studied and constructed,which is used to depict the degree of dissimilarity of data object clustering within and between clusters in a certain cluster;and a new anomaly judgment factor is constructed combining with local density,which can effectively prevent two data objects belonging to the same cluster and having close distance from being selected as cluster centers and forcing a cluster to be split into two clusters.In this paper,an anomaly detection algorithm based on multi-dimensional attribute clustering and local density is proposed.Through UCI data set experiment,this algorithm has significant advantages in the performance of detection rate and accuracy evaluation index.Secondly,aiming at the problem of local anomaly detection in a large number of similar data sets,the data set summary and the method of generating abnormal reference data are studied,and the parameter parallel optimization method of data set summary is studied.This paper innovatively proposes a method of automatic optimization of local anomaly detection parameters and parallel processing of large data sets.The practicability of the algorithm is demonstrated by the experimental analysis of the detection accuracy of a large number of similar data sets,the summary efficiency of different data set sizes,the summary efficiency of different number of abnormal seed candidate sets and the execution time complexity of the summary algorithm.Thirdly,aiming at the problem of privacy protection in anomaly detection,privacy protection anomaly detection algorithms are proposed from three aspects: data disturbance,data encryption and anonymity.(1)From the point of view of data disturbance,according to the local density and clustering dissimilarity,the complex factor of data disturbance isstudied and constructed a privacy protection anomaly detection algorithm based on complex transform data disturbance is proposed innovatively.(2)From the point of view of data homomorphic encryption,the vertical partitioning of distributed multi-source data sets is used to obtain local and global distance matrices,and the disturbance matrices are added to protect the data privacy of participants,a domain-connected privacy protection anomaly detection algorithm is proposed.(3)From the perspective of anonymity,the anomalous reference data sharing area in the data set summary is studied and the shared anomaly reference data is anonymized,an anonymous outline method for local anomaly detection of large data sets is proposed innovatively.The experimental analysis of the degree of privacy protection,the accuracy of anomaly detection and the detection rate of anomaly detection reflects the validity of the algorithm.Fourthly,for the problem of general dimensional data flow anomaly detection,the data flow anomaly detection model based on sliding window and multiple verification is studied.(1)The factors such as angle-based local density,cluster center factor,k-neighborhood distance and local increment are studied,and the enhanced angle anomaly factor is constructed.Combined with the anomaly decision criterion of mean and standard deviation,an anomaly detection algorithm based on enhanced angle anomaly factor is proposed.(2)The calculation methods of the mean value of vector point product and the local vector point product density are studied,combined with the anomaly measurement of data stream and in order to reduce the influence of manual intervention,an anomaly decision criterion based on maximum slope is proposed,and innovatively proposed a data flow anomaly detection algorithm based on local vector dot product density.The algorithm has significant advantages in comparison experiments with detection accuracy and misjudgment rate.Fifthly,aiming at the problem of anomaly detection in high-dimensional data stream,the data sparsity of high-dimensional data stream is studied.According to the data sparsity and data partitioning model,the L-neighbor,hypercube density,direct reachability of hypercube density and density connection of the data stream are studied,and the overlapping accumulation value to reduce the error detection rate of the anomaly detection algorithm is proposed.This paper creatively proposes a high-dimensional data stream anomaly detection algorithm based on hypercube density.The Comparative experimental analysis of detection accuracy,misjudgment rate,ROC performance curve and AUC measurement shows that the algorithm has significant advantages and improves the ability of dynamic detection of abnormal features of high-dimensional data streams.
Keywords/Search Tags:outlier detection, complex transformation, enhanced angle, vector dot product, hypercube density
PDF Full Text Request
Related items