Font Size: a A A

A Study On Outlier Detection Algorithms For High Dimensional Data

Posted on:2021-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:X D XuFull Text:PDF
GTID:1368330614469673Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Outlier detection aims at identifying anomalous or abnormal data,which show significantly different or even opposite behaviors to normal data,out from given data.Due to its wide range of potential applications including network intrusion,medical health,credit fraud,information retrieval,video surveillance and social public safety,outlier detection has now become a hot research topic in data mining.By now,a variety of outlier detection methods have been proposed.However,there are two challenging issues,which require more efforts to be made,when the routine outlier detection methods handle high-dimensional data,which is pervasive in reality.Firstly,the high dimensionality may induce the problem of the curse of dimensionality,which not only makes outlier detection difficult,but also increases time complexities of the learning methods.Secondly,in the high-dimensional space,the problem of distance concentration may become more deteriorative,where the Euclidean distances between data objects in a high-dimensional space tend to be the same,making the data objects close to each other and can not be distingulished easily.Therefore,how to effectively and efficiently get neighborhood information in the high-dimensional space is an open problem for outlier detection.This paper focuses on the above problems of outlier detection and carries out research work from two aspects: taking new spatial mapping relationships and getting neighborhood relationship effectively.Specifically,it proposed three effective methods for identifing outliers in high dimentional data,including hash mapping based method,object representation and pagerank based method,and sparse represention based method.Experimental analysis shows the proposed methods effectively improve the performance of outlier detection in high-dimensional data.The main research contents of this paper are summarized as follows:1.In order to solve the problem that outliers are difficult to identify in all dimensional spaces,a new hash mapping method based on local sensitive hash was proposed.Specifically,it projected the high-dimensional data into a low-dimensional space via local sensitive hash(LSH)and explored outliers in the new space by taking graph clustering technique.The experimental results show that the proposed method can obtain rich data correlation information in low-dimensional space,thereby has reduced complexity and improved the accuracy of outlier detection.2.For the problem that neighborhood information is difficult to obtain in high dimensional spaces due to distance concentration,this paper proposed an outlier detection method based on object representation and importance ranking.It first obtained the liner representation coefficients of each object by using other objects,and then got the relation matrix according to the relationship coefficients between the objects,and finally detected outliers by taking improved Page Rank method.The advantage of the proposed method is that it can obtain stable neighbourhood relationship without distance calculation,and thus can get high precision of outlier detection.The experimental results on multiple real datasets verify the effectiveness of this method.3.For the problem that the neighborhood or similarity-based methods cannot automatically select neighbors in high dimentional data,this paper proposed a noval strategy to get outliers based on sparse representation.It first projected the high-dimensional data into a low-dimensional space via a sparse operation and explored representative neighbors with a self-representation learning technique,and then provided two techniques including random walk and spectual clustering to detect outliers.The proposed method can automatically obtain neighborhood effectively without parameter setting.Comparative experiments with different algorithms on multiple real data sets show that the proposed methods greatly improve the accuracy and stability of the outlier detection algorithms.
Keywords/Search Tags:high-dimensional data, outlier detection, sparse learning, spectual clustering, locally sensitive hashing, random walk
PDF Full Text Request
Related items