The Research On Subspace Clustering For High Dimensional Data

Posted on:2008-01-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Gan

Full Text:PDF

GTID:2178360215951633

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Clustering is an important task of data mining. Nowadays clustering large-scaled and high dimensional datasets is a hard and hot issue. Because of data sparsity, empty space phenomenon and the curse of dimensionality, it is common for all of the objects in a dataset to be nearly equidistant from each other, completely masking the clusters. As the number of dimensions in a dataset increases, distance measures become increasingly meaningless. So traditional clustering methods based on distance similarity can't perform well. In order to solve these problems, in this thesis we have done some work as follows:Firstly, because traditional clustering algorithms encounter many difficulties and challenges when dealing with the high dimensional data, we compare the advantages and shortcomings of different dimensional reduction methods, and then conclude that subspace clustering methods are essential and useful.After further discussing existing subspace clustering algorithms, we find that they can't work as efficiently as we imagine. The reason lies in that traditional subspace clustering algorithms must scan database many times to discover the subspace of clusters. Moreover, these methods can deal with only single data type, numerical or categorical data. Since we notice the great similarity between subspace and frequent pattern of association rule analysis, this thesis proposes a subspace clustering method based on pattern tree (PSC Algorithm for short). PSC can discover the subspace by scanning the database once. So PSC can improve the efficiency of clustering. Besides, the method can handle both numerical data and categorical data. Experiments demonstrate that our method significantly improves on the accuracy and speed of previous methods.Most clustering models define similarity among different objects by distances over dimensions. However, distance functions are not always adequate in capturing correlations among the objects. In fact, strong correlations may still exist among a set of objects even if they are far apart from each other as measured by the distance functions, while pattern-based clustering can discover this kind clusters. But state-of-the-art pattern-based clustering methods are inefficient and haven't criteria to evaluate the quality of clusters. Hence, it becomes important to enable pattern-based clustering methods i) to handle large datasets, and ii) evaluate the quality of clusters. In this thesis, we present a novel algorithm-PPSC that offers this capability. The method uses new evaluation criteria to discover best clusters, which make the result of clustering more meaningful. Meanwhile, by applying the pattern-tree PPSC can determine subspace by scanning the database once, so it can perform efficiently in large datasets.

Keywords/Search Tags:

Data mining, Clustering analysis, Subspace clustering, Pattern tree, Pattern similarity

PDF Full Text Request

Related items

1	Research On Algorithms Of Subspace Clustering Based On Pattern Similarity
2	A Study Of The Pattern-Based Clustering Theories
3	Research On Subspace Clustering Algorithm On High-dimensional Categorical Datasets
4	Research On Web Log And Subspace Clustering Mining Algorithms
5	Research On Clustering Algorithms For High-Dimensional Data
6	Pedestrian Behavior Pattern Recognition And Analysis Of Indoor Location Data
7	The Research On The Method Of QAR Data Organization Based On Data Warehouse And The Similarity Measurement Of Clustering Pattern
8	Research On Improved Subspace Clustering Algorithm
9	Research On Algorithm Of Web User Browsing Pattern Fuzzy Clustering
10	Study On Outlier Mining Algorithms Based On Clustering