Font Size: a A A

The Research On Subspace Clustering For High Dimensional Data

Posted on:2008-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y L GanFull Text:PDF
GTID:2178360215951633Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Clustering is an important task of data mining. Nowadays clustering large-scaled and high dimensional datasets is a hard and hot issue. Because of data sparsity, empty space phenomenon and the curse of dimensionality, it is common for all of the objects in a dataset to be nearly equidistant from each other, completely masking the clusters. As the number of dimensions in a dataset increases, distance measures become increasingly meaningless. So traditional clustering methods based on distance similarity can't perform well. In order to solve these problems, in this thesis we have done some work as follows:Firstly, because traditional clustering algorithms encounter many difficulties and challenges when dealing with the high dimensional data, we compare the advantages and shortcomings of different dimensional reduction methods, and then conclude that subspace clustering methods are essential and useful.After further discussing existing subspace clustering algorithms, we find that they can't work as efficiently as we imagine. The reason lies in that traditional subspace clustering algorithms must scan database many times to discover the subspace of clusters. Moreover, these methods can deal with only single data type, numerical or categorical data. Since we notice the great similarity between subspace and frequent pattern of association rule analysis, this thesis proposes a subspace clustering method based on pattern tree (PSC Algorithm for short). PSC can discover the subspace by scanning the database once. So PSC can improve the efficiency of clustering. Besides, the method can handle both numerical data and categorical data. Experiments demonstrate that our method significantly improves on the accuracy and speed of previous methods.Most clustering models define similarity among different objects by distances over dimensions. However, distance functions are not always adequate in capturing correlations among the objects. In fact, strong correlations may still exist among a set of objects even if they are far apart from each other as measured by the distance functions, while pattern-based clustering can discover this kind clusters. But state-of-the-art pattern-based clustering methods are inefficient and haven't criteria to evaluate the quality of clusters. Hence, it becomes important to enable pattern-based clustering methods i) to handle large datasets, and ii) evaluate the quality of clusters. In this thesis, we present a novel algorithm-PPSC that offers this capability. The method uses new evaluation criteria to discover best clusters, which make the result of clustering more meaningful. Meanwhile, by applying the pattern-tree PPSC can determine subspace by scanning the database once, so it can perform efficiently in large datasets.
Keywords/Search Tags:Data mining, Clustering analysis, Subspace clustering, Pattern tree, Pattern similarity
PDF Full Text Request
Related items