Font Size: a A A

Research On Clustering Algorithms For High-Dimensional Data

Posted on:2012-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:M ChenFull Text:PDF
GTID:2218330338474173Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cluster analysis is an important research area of data mining, The goal of clustering is to partition data set into such clusters that intra-cluster data are similar and inter-cluster data are dissimilar without any prior knowledge, which is very different from data classification. Clustering has been widely used in the area of information filtering, automatic classification of data, market analysis and so on. As an important task of cluster analysis, attention has already been paid to the research of clustering algorithms for high-dimensional data. Traditional clustering algorithms are always failed for the reason of "Sparsity" and "Dimension effect" of high-dimensional data. But in real life, there are a large number of high-dimensional data such as Retail transaction data, Document data, Spatial data, Geographic data, Multimedia data, Network access to data, Time series data, Genetic data and so on. So the research of clustering algorithms for high-dimensional data has very important meanings.There are three main methods for clustering high-dimensional data:First, method based on dimension reduction. Second, method based on subspace. Third, other method.Based on the research of the three kinds of clustering algorithms for high-dimensional data, we propose an algorithm named SDSCA based on similar dimension. Firstly the Gini value is used to remove the redundant attributes in the data space. After removing the redundant attributes, the similar dimension is used to find the attributes that are close to each other. Finally, the traditional clustering algorithms are used on these subspaces that formed by similar dimension. The experiment results show that algorithm SDSCA is effective and also reduces the redundant attributes effectively. We also propose an improved algorithm based on pattern similarity named PPSC* and make two improvements towards the original algorithm. First, Use the Gini value to remove those redundant attributes in the data space. Second, Remove these transactional databases that with few information, and choose the left ones to construct P-tree in order to mining frequent subspace and clusters. The experiment results show that the improved algorithm PPSC* is more effective than the original one.
Keywords/Search Tags:high-dimensional data, clustering analysis, subspace clustering, pattern similarity
PDF Full Text Request
Related items