Font Size: a A A

Parameter Grid Clustering Algorithm

Posted on:2010-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2208360302977295Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining technique can be used to find out potential and useful knowledge from the vast amount of data, and it plays a new significant role to store data in the info-times. With the rapid development of the data mining technique, the technique of clustering analysis, as an important part of data mining, are widely applied to the fields such as pattern recognition, data analysis, image processing, market research and so on. Research on grid clustering algorithms and boundary points has become highly active topics in the data mining research.In this thesis, we present the theory of data mining, and deeply analyze the algorithms of grid clustering. Then we find that there are some problems. Firstly, clustering results are sensitive to the input parameters. The number of partition on each dimensional and the density threshold are difficult to determine. Secondly, most traditional grid-based algorithms can not discover clusters correctly in multi-density datasets. To solve the two problems above, we propose a nonparametric grid-based clustering algorithm (NaRic) based on the concepts of grid entropy and grid informativity degree. Aiming to the disadvantages that BORDER has low efficiency and detecting precision, we develop a boundary points detection algorithm based on statistics information (BPSF). For the two algorithms NaRic and BPSF, we elaborate the idea of every algorithm and expound the functions of algorithms.In this thesis, we have developed NaRic, BPSF, SNN (shared nearest neighbor clustering algorithm) and BORDER algorithm and implemented it using Visual C++ 6.0. We conducted a series of experiments, including the experiment of the correctness and efficiency of grid clustering, on synthetic datasets and the real dataset.The experimental results show that, without any input parameter, the NaRic algorithm can discover clusters of arbitrary shapes and sizes, and also can identity clusters of uniforms datasets and multi-density datasets with noises. Compared with traditional shared nearest neighbor clustering algorithm SNN, NaRic has higher accuracy and efficiency. And BPSF algorithm can find boundary points of clusters of arbitrary shapes, different sizes and remove noise effectively. The accuracy and time complexity are higher than traditional boundary points detecting algorithm BORDER.In a word, the nonparametric grid-based clustering algorithm (NaRic) can solve the problem that the grid clustering algorithm is sensitive to parameters, and can identity clusters of multi-density datasets. And BPSF algorithm can identify the boundary points of clustering; it has higher precision and efficiency than that of BORDER.
Keywords/Search Tags:Clustering algorithm, Grid, Nonparametric, Multi-density, Entropy, Variance, Boundary points
PDF Full Text Request
Related items