Research On Clustering Algorithms For High-Dimensional Data

Posted on:2012-09-09

Degree:Master

Type:Thesis

Country:China

Candidate:M Chen

Full Text:PDF

GTID:2218330338474173

Subject:Computer application technology

Abstract/Summary:

Cluster analysis is an important research area of data mining, The goal of clustering is to partition data set into such clusters that intra-cluster data are similar and inter-cluster data are dissimilar without any prior knowledge, which is very different from data classification. Clustering has been widely used in the area of information filtering, automatic classification of data, market analysis and so on. As an important task of cluster analysis, attention has already been paid to the research of clustering algorithms for high-dimensional data. Traditional clustering algorithms are always failed for the reason of "Sparsity" and "Dimension effect" of high-dimensional data. But in real life, there are a large number of high-dimensional data such as Retail transaction data, Document data, Spatial data, Geographic data, Multimedia data, Network access to data, Time series data, Genetic data and so on. So the research of clustering algorithms for high-dimensional data has very important meanings.There are three main methods for clustering high-dimensional data:First, method based on dimension reduction. Second, method based on subspace. Third, other method.Based on the research of the three kinds of clustering algorithms for high-dimensional data, we propose an algorithm named SDSCA based on similar dimension. Firstly the Gini value is used to remove the redundant attributes in the data space. After removing the redundant attributes, the similar dimension is used to find the attributes that are close to each other. Finally, the traditional clustering algorithms are used on these subspaces that formed by similar dimension. The experiment results show that algorithm SDSCA is effective and also reduces the redundant attributes effectively. We also propose an improved algorithm based on pattern similarity named PPSC* and make two improvements towards the original algorithm. First, Use the Gini value to remove those redundant attributes in the data space. Second, Remove these transactional databases that with few information, and choose the left ones to construct P-tree in order to mining frequent subspace and clusters. The experiment results show that the improved algorithm PPSC* is more effective than the original one.

Keywords/Search Tags:

high-dimensional data, clustering analysis, subspace clustering, pattern similarity

Related items

1	Research On Clustering Algorithms For High-Dimensional Data
2	Research On Algorithms Of Subspace Clustering Based On Pattern Similarity
3	Study On High-dimensional Data Subspace Clustering Analysis And Application
4	Research On Subspace Clustering Algorithms For High-dimensional Data
5	Research On Subspace Clustering Algorithm For High Dimensional Data
6	Research On Subspace Clustering Algorithm On High-dimensional Categorical Datasets
7	Research On Key Technologies Of Clustering High-dimensional Data Based On Sparse Subspace And Their Applications
8	Research On Improved Subspace Clustering Algorithm
9	Research On Clustering Algorithms For High-Dimensional Data
10	Improvement Research Of Clustering Algorithm Based On High-dimensional Data