Font Size: a A A

Research Of Outlier Detection Algorithm Based On Sparse Coding And Neighborhood Entropy In High-dimensional Space

Posted on:2021-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhouFull Text:PDF
GTID:2518306107483654Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,as data mining has attracted more and more attention,mining potential and valuable information from massive data has gradually become an important and challenging task.What's more,the outliers in a dataset usually contain more information than ordinary samples.Therefore,mining outliers which have inconsistent performance with most data has become an important branch of data mining,and the methods are also widely used in various fields,such as intrusion detection,natural disaster prediction,credit fraud detection,etc.However,in the current era of information explosion,the problem for researchers is not the huge amount of data,but also the complexity of them.With the rapid growth of data dimension,it becomes difficult to get satisfactory results by using traditional outlier detection methods which are based on Euclidean distance.Therefore,this thesis proposes a Sparse coding and Neighborhood entropy based Outlier Detection algorithm(SNOD)to overcome the failure of traditional Euclidean distance metric in high-dimensional space.The main research contents include following aspects:Firstly,this thesis studied some existing high-dimensional outlier detection methods and traditional Euclidean distance based detection methods,including subspace-based methods,Isolation Forest algorithm(IF)and Local Outlier Factor algorithm(LOF).It also explored the performance differences of these algorithms in datasets with varying dimensions,and summarized the regular pattern of outliers that can be detected,especially when the dimensionality surged.By observing the sparse representation of samples,this thesis found that the sparse representation of ordinary samples and outliers has significant difference in the use of some atoms in the sparse dictionary.In addition,a unique dictionary is built based on samples to make the process of representation calculation more efficient.At the same time,it also adaptively constructs the neighborhood for the samples and focuses on the local area,which helps to detect more outliers.Combining the idea of LOF and the assumption that samples closer to the given sample will get higher coefficients when solving the sparse representation problem,the concept of neighborhood entropy is proposed.According to the influence of samples on the overall information entropy in the neighborhood,the abnormal degree is observed to calculate the outlier score of samples.Finally,this thesis conducts extensive experiments on 11 high-dimensional datasets to demonstrate the effectiveness of our method.Experiments on these benchmark datasets and the comparison to the state-of-the-art methods validate the advantages of our algorithm.
Keywords/Search Tags:Outlier Detection, Local Outlier, High-dimensional, Sparse Coding
PDF Full Text Request
Related items