Font Size: a A A

Research On Some Algorithms For High-Dimensional Data Clustering

Posted on:2009-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:J J LiuFull Text:PDF
GTID:2178360242993657Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining is a technology of mining useful information from database. This information which is mined from database is also called knowledge, so data mining is also called KDD(Knowledge Discovery in Databases). The mined knowledge can be used in decision support, data analysis and any other field. By the development of database, data mining has been more and more important.By the increase of the size of data in databases, traditional clustering methods can't take effectly. Actually, the operation of clustering is just the measurement of similarity between different data objects. These data objects which have the high similarity should be assigned to one cluster. Euclid distance function and any other distance based functions are usually used in clustering on low-dimensional data. But they can't work well on high-dimensional data, because they can't transfer between different objects. And it's difficult to define these functions on high-dimensional data, so new measurement of similarity must be considered. Otherwise, because of the high-dimension, traditional algorithms always have the high complexity and limitation in their applications.In allusion to the problem of dimensionality curse caused by high-dimensional data, we proposed a clustering algorithm based on single-dimension partition which is called HDCA_SDP in this paper. It takes use of the relationship of single-dimension and multi-dimension. Transpose of index on single dimension is taken to predigest the method of high-dimensional data process. This algorithm works on one dimensional layer each time, processes with each layer and finally all the clusters will be gained.We have analyzed and improved a few improved algorithms on traditional k-means algorithm. On the base of HDCA_SDP algorithm, we proposed DFBC algorithm which is suit for higher-dimensional data. DFBC algorithm first partition the whole dimensions to some combinations of low dimensions, then cluster on each combination. The process is similar to HDCA_SDP algorithm. Compared to HDCA_SDP, DFBC algorithm is more suitable on large and higher-dimensional data. The running time of DFBC algorithm increases linearity with the increases of combinations. It's actually a compromise of low-dimensional and high-dimensional clustering method on its idea.We researched some technologies of grid based algorithms and analyzed fixed-grid based and adapt-grid's shortages. In allusion to GCOD's shortage we proposed an improved algorithm. A measure of intersecting grid is adopted in GCOD algorithm which is a compromise method of fixed-grid and adapt-grid based algorithms. But it hasn't limited the size of intersecting grids, and then there will be some irrationality clusters finally. In allusion to this problem we proposed a method to restrict the size of intersecting grids and more rational density computing method.Some clustering algorithms on subspace are researched in this paper. In allusion to the shortage of traditional CLIQUE algorithm, HIGSC algorithm which is based on half intersecting grids is proposed. It firstly clusters on all of the single dimensions based on half intersecting grids. Then generates subspace by the similar rule of Apriori. In the course of generating subspace clustering are generated by similar HDCA_SDP algorithm. Compared to CLIQUE, the performance of HIGSC algorithm is improved, especially on the aspect of precision of clustering.
Keywords/Search Tags:Data Mining, clustering, high-dimensional clustering, subspace
PDF Full Text Request
Related items