With the wide usage of information technology, data generated from varies information systems are accumulated rapidly, and the higher efficient data mining tools was needed to find valuable knowledge patterns. Clustering analysis is an important method in data mining. It is a discovery process that groups a set of data such that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. Clustering of data in a high dimensional space is of a great interest in many data mining applications. With high-dimensionality data sets, how to find the latent and natural clusters is more difficult and need to be resolved.In this dissertation we do some research on high dimensional data. First we introduce the characters of high dimensional data-sparse, the curse of dimensionality, and their effect. According to some paper, we classify the clustering algorithms for high dimensional data as follow: dimensional reduction, subspace clustering, collaborative clustering and other clustering algorithm. We analyze the subspace clustering algorithm ENCLUS, and point out the defects of it as following:1. It requires users to input parameters, such as the interval of one dimension. As users often may not have information to choose those input parameters.2. Some points in sparse grid neighboring dense grid was regard as outliers.So we present a novel algorithm, called OGBS. The mostly and innovative work as following:1. According the information theory, if the data follows a normal distribution, the number of equisized bins for an ideal frequency histogram should be (1+log2N) to show the distribution property of the high dimensional datasets. So we partition one dimension into (1+log2N) interval.2. Bisect the spare grid into smaller grid and calculate the density of smaller grid, estimate which cluster the grid should fall into.According to experimental results, the OGBS algorithm is better than the existed projected clustering algorithm ENCLUS. The cluster obtained from OGBS have smoother boundary than ENCLUS.OGBS is applied in remote sensing data clustering. We present the idea spatial continuity was proposed and successfully applied to remote sensing data clustering to analyze outliers. Experimental results indicate the method is better than ones in the software of ENVI. |