Research On Clustering Algorithms For Very Large And High Dimensional Data

Posted on:2008-03-23

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Wang

Full Text:PDF

GTID:2178360215471163

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the wide usage of information technology, data generated from varies information systems are accumulated rapidly, and the higher efficient data mining tools was needed to find valuable knowledge patterns. Clustering analysis is an important method in data mining. It is a discovery process that groups a set of data such that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. Clustering of data in a high dimensional space is of a great interest in many data mining applications. With high-dimensionality data sets, how to find the latent and natural clusters is more difficult and need to be resolved.In this dissertation we do some research on high dimensional data. First we introduce the characters of high dimensional data-sparse, the curse of dimensionality, and their effect. According to some paper, we classify the clustering algorithms for high dimensional data as follow: dimensional reduction, subspace clustering, collaborative clustering and other clustering algorithm. We analyze the subspace clustering algorithm ENCLUS, and point out the defects of it as following:1. It requires users to input parameters, such as the interval of one dimension. As users often may not have information to choose those input parameters.2. Some points in sparse grid neighboring dense grid was regard as outliers.So we present a novel algorithm, called OGBS. The mostly and innovative work as following:1. According the information theory, if the data follows a normal distribution, the number of equisized bins for an ideal frequency histogram should be (1+log₂N) to show the distribution property of the high dimensional datasets. So we partition one dimension into (1+log₂N) interval.2. Bisect the spare grid into smaller grid and calculate the density of smaller grid, estimate which cluster the grid should fall into.According to experimental results, the OGBS algorithm is better than the existed projected clustering algorithm ENCLUS. The cluster obtained from OGBS have smoother boundary than ENCLUS.OGBS is applied in remote sensing data clustering. We present the idea spatial continuity was proposed and successfully applied to remote sensing data clustering to analyze outliers. Experimental results indicate the method is better than ones in the software of ENVI.

Keywords/Search Tags:

data mining, clustering within subspace, high dimensionality, grid

PDF Full Text Request

Related items

1	Application Of Grid And Density Based Clustering Algorithm In Data Mining
2	Research On Improved Subspace Clustering Algorithm
3	Research Of Subspace-clustering Algorithms Based On Density Over High-dimensional Data
4	Research On Subspace Clustering Algorithm For High Dimensional Data
5	Research On Dimensionality Reduction And Clustering Methods For High-dimensional Data Based On Metric Learning
6	The Research On Subspace Clustering For High Dimensional Data
7	Research On Clustering Algorithm Based On Irregular Grid And Subspace Of Descending Dimension
8	Study On Grid-Based Clustering Algorithms
9	Research On Improved Sparse Subspace Clustering Algorithm
10	Research On Clustering Algorithem For High Dimensional Data