Font Size: a A A

Research And Application Of Outlier Detection Algorithm Based On Subspace

Posted on:2021-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:J C FengFull Text:PDF
GTID:2518306095475774Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As one of the important research contents of data mining,outlier detection is widely used in many fields such as credit card fraud transactions,fault detection and medical diagnosis.In order to effectively detect outliers in high-dimensional data,researchers have proposed subspace-based outlier detection algorithms,in which isolated forest(IForest) and histogram-based outlier detection(HBOS) are two typical subspace algorithms,which are widely used in production and life,but the algorithms still have defects such as low accuracy,poor stability and low efficiency.In this paper,starting from the perspective of outlier feature subspace sampling,the problems of the above two algorithms are studied separately,corresponding improvement strategies are proposed,and the improved algorithm is applied to the analysis of celestial spectral data.The main research contents are as follows:(1)Aiming at the problems that the IForest has greater randomness in the process of building isolation trees and lower efficiency when combining isolation trees,this paper proposes a fast outlier detection algorithm based on isolation forest.The algorithm first selects the isolation tree samples through a heuristic method,and selects specific cutting points during the tree-building process to insert data into the corresponding leaf nodes to reduce the impact of random selection on the performance of the algorithm.Secondly,several isolation trees are formed into an isolation forest,and the outlier degree of each leaf node isolated is calculated.Finally,select several data objects with large outlier degree as the final outliers.Theoretical analysis and experimental results show that the improved algorithm can effectively improve the efficiency of the isolated forest algorithm.(2)The HBOS algorithm is more efficient in detecting high-dimensional and massive data,but the algorithm assumes that each dimension is independent,which may lead to unstable detection results and insensitivity to local outlier data.In view of the above problems,an improved histogram algorithm is proposed.The algorithm first uses extreme gradient lifting trees to search outlier-related subspaces in the data set,and selects features according to the importance of outliers to construct outlier histograms.Second,density estimation is performed for each sample and the outlier value are calculated.Finally,mark several data objects with high outlier value as outliers.Experiments show that the improved algorithm has significant improvement in accuracy and stability compared with HBOS.(3)On the basis of the above research,the outlier analysis of LAMOST spectral data is carried out,and a prototype system for astronomical spectrum outlier detection is designed and implemented.This prototype system introduces system-related modules and functions.The results show that the prototype system provides a new way to explore outliers and special objects.
Keywords/Search Tags:Outlier detection, Subspace, Isolated forest, Histogram, Feature ranking, Astronomical apectrum
PDF Full Text Request
Related items