Font Size: a A A

Analysis And Research Of Outlier Detection Algorithm For High Dimensional Data

Posted on:2020-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:K ChenFull Text:PDF
GTID:2428330590972688Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Outlier detection is an important branch of data mining.Filtering the noise in the data set and mining the potentially meaningful information in the data set make the outlier detection have far-reaching practical significance and broad application prospects.In the era of rapid development of information technology and networks,the application of high-dimensional big data can be seen everywhere.In the high-dimensional big data application scenario,when the data is analyzed in a full-dimensional manner,the data becomes sparse,and the true outliers are covered by multi-dimensional noise effects.For high-dimensional data,the traditional outlier detection method is difficult to effectively detect outliers in the data,and the efficiency is also low.Therefore,searching subspaces related to outliers and mining outliers in high-dimensional data have become a research hotspot of high-dimensional outlier detection algorithms.In this paper,the existing outlier detection methods are analyzed and discussed,and the common solutions for high dimensional data are introduced.Based on high-dimensional big data,this paper proposes two outlier detection algorithms.The main work is as follows:First,this paper proposes a high-dimensional outlier detection algorithm based on random hash split forest,which uses a locally sensitive hashing algorithm and tree structure.Locally sensitive hashing algorithms can hash similar data instances into the same bucket.Then,the anomaly of the instance is measured based on the number of instances in the bucket in which the data instance is located.Similar to the idea of isolation forest,this algorithm uses a tree structure to partition the data set.In each process of dividing data,an attribute is randomly selected as a partition attribute.The locally sensitive hashing algorithm is used as the hash map on the selected attribute,and the data points with similarity on the attribute are divided into the same area.A subspace is formed from the root node of the tree to the leaf node.To improve robustness,the algorithm also uses random sampling techniques.Finally,this algorithm combines the given data instances with anomalous scores in different subspaces.Experimental results show that the algorithm is efficient for outlier detection in high dimensional data.Second,this paper proposes a relevant subspace selection algorithm based on sequential ensemble.Mislabeled data is a kind of outlier.The removal of such outliers can effectively improve the performance of the classifier.However,in the high-dimensional big data scenario,it is difficult to effectively mine such outliers under the influence of the “dimensional disaster”.Traditional methods of solving highdimensional problems often take dimensionality reduction techniques and mislabeled noise detection as two separate processes.This type of method does not guarantee that the resulting subspace can help with anomalous scores.In this paper,we use the lasso sparse regression model,taking the noise score as the target attribute of the regression,and the original feature space as the predictor.The method integrates the two steps of subspace selection and noise mining to form a sequential ensemble.In the sequential ensemble learner,subspace selection and noise mining are mutually facilitated,and subspaces are associated with noise scores.Experimental results show that the algorithm can effectively detect mislabeled data with high dimensionality.Traditional outlier detection algorithms are ineffective and inefficient for high dimensional data.To solve these problems,two novel outlier detection algorithms are proposed.Experimental results show the effectiveness of proposed methods.
Keywords/Search Tags:Outlier detection, high dimensional data, relevent subspace, noise detection, ensemble learning
PDF Full Text Request
Related items