Clustering Algorithms Analysis On Data Dimension

Posted on:2008-10-22

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Q Zhang

Full Text:PDF

GTID:1118360245492669

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In this paper, the characters and requirements of clustering algorithms on different data dimensions are analyzed. We give out our resolutions and algorithms for low dimensional clustering, high dimensional clustering, and subspace clustering.We design an algorithm called GDMS to resolve low dimensional clustering. A detecting window is introduced to find out the distributions of data sets. A filter function framework is introduced to distinguish clusters of different density or attributes. Different movement styles and their combinations are designed to speedup algorithm with high accuracy. A new mathematics morphological operator is introduced to discover clusters, which is more accurate than the ordinary operators: open and close. The scale space theory is integrated with mathematics morphology to get multi-scale and hierarchical clusters. In addition, GDMS is extended to support obstacle clustering. The advantages of GDMS are: it is very efficient with a complexity of O(N); it is effective in discovering clusters of arbitrary shape; it can distinguish clusters of different density or attributes; it is not sensitive to noise; it is not sensitive to the grid size; its hierarchical result is easy to understand and interpret.We design three algorithms MDCLUS, IMDCLUS, and PMDCLUS to speedup high dimensional clustering. Monte Carlo method is adopted for acquiring core objects to reduce computation complexity. The estimation of minimum sample ratio is analyzed quantitatively to avoid clusters' losing and breaking. A label hash method is introduced to get a linear computation complexity with object's number for clusters emerging. We realized increment clustering and distributed parallel clustering. The advantages of our approach are: it is effective in discovering clusters of arbitrary shape; it is not sensitive to noise; the clustering complexity is linear with data dimensions; it is faster than DBSCAN algorithm.We design active spaces and grid algorithms for subspace clustering. We prove the downward closure properties of density, connection, and coverage. A top-down search method is introduced to reduce searching spaces. A new filter method based on active axis numbers is introduced to filter noise objects. The algorithm based on fixed grid is extended to adaptive grid for efficiency. A distributed parallel algorithm is realized for large and high dimensional data sets. We introduce a hierarchical method to arrange clustering result. The advantages of our approach are: it can discover clusters both on entire space and subspace; the computation complexity is proximate linear with object's number, space dimension, and clusters' dimension respectively; it is not sensitive to noise; the clustering result is easy to understand and interpret; it can find both disjoint clusters or overlap clusters.

Keywords/Search Tags:

Incremental clustering, Parallel clustering, Subspace clustering, Mathematics morphology, Scale space

PDF Full Text Request

Related items

1	Research On Dynamic Clustering And Incremental In Data Mining
2	The Research On Several Issues Of Clustering And Clustering Validity Indexes
3	Research On Clustering Algorithms For Large-scale Complex Data
4	Research On Sparse Subspace Clustering Models And Algorithms Based On Low-rank Representation
5	Research On Fast And Effective Subspace Clustering Methods
6	Research On Improved Sparse Subspace Clustering Algorithm
7	Research On Key Problems About Large-Scale Text Clustering
8	GPGPU-oriented Research On Parallel Incremental Clustering Algorithms
9	Semi-supervised Subspace Clustering Based On Space-level Constraint
10	The Research On The Method To Measure The Validity And To Abstract Knowledge Of Clustering