Font Size: a A A

Study On Outlier Detection In Subspace

Posted on:2011-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2178330338491056Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Outlier detection has become a hot issue in the field of data mining. With the constant expansion of its scope of application, traditional outlier detection algorithms have encountered a biggest obstacle that they can not meet the high-dimensional data characteristics. For this problem, researchers proposed several methods. In these methods, subspace mining is an effective method for high dimensional data mining. In currently proposed subspace outlier detection algorithms, there are still many problems. For instance: the accuracy of these algorithms is low; to select the algorithm's parameters is difficult which lead to an unstable result; and so on. This paper mainly for the above problems does some research on subspace outlier detection algorithm.Firstly, the outlier detection in axis-parallel subspaces of high dimensional data (SOD) algorithm is introduced. For the deficiency of this algorithm, an improved algorithm is proposed. On the one hand, through quantifying the aggregation of each dimension, the reference value of each dimension can be fixed, thus reducing the parameter settings'impact on algorithm results. On the other hand, using the relative distance to show the degree of deviation is convenient for detecting outlier in different densities subspace.Secondly, because the number of cluster in data set is unknown, so combined with Gini-entropy, the relevant subspace measure based on Gini-entropy is proposed. And the relevant subspace outlier degree is defined. Based on these, a new outlier detection algorithm RSOD based on relevant subspace is proposed. This algorithm reduces the requirements of priori knowledge of data set. It is not limited by the number of clusters in data set. Whether the data set contains one or more than one cluster, the algorithm can effectively select relevant subspace and detect outliers.Finally, four data sets which contain synthetic data set and real data set are used to validate the two algorithms proposed in this paper.
Keywords/Search Tags:Data Mining, Outlier, High-dimensional data, Subspace, Entropy
PDF Full Text Request
Related items