Font Size: a A A

Research On Determining The Number Of Clusters Based On Information Entropy

Posted on:2012-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:X W ZhaoFull Text:PDF
GTID:2218330368489610Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering analysis, viewed as a kind of unsupervised learning, is a fundamental means of data granulation, and information compression. It is also an important tool and method in machine learning and data mining research. There have been a lot of clustering algorithms developed in the data mining research community. Interesting applications of clustering can be found in bioinformatics, web data analysis, information retrieval, text mining, and scientific data exploration, to name only a few major areas.However, most of the algorithms above-mentioned need a user-specified number of clusters or implicity cluster number control parameters in advance. Unfortunately, in many situations, how many clusters exist in the given data is unknown and needs to be estimated from data themselves. Therefore, identifying the number of clusters in a data set, a quantity often labeled k, is a fundamental and important topic in clustering analysis.The problems of how to determine the number of clusters in clustering analysis are focally investigated in this paper. The main contributions of this paper are summarized as follows:(1) Based on the ideas of partitional and hierarchical clustering, an algorithm is proposed to determine the best number of clusters for categorical data, and the corresponding time complexity is analyzed. Experimental results on real world datasets of UCI demonstrate the proposed algorithm is effective.(2) Aim at mixed data, this paper presents a theoretic framework based on information entropy, which can be used to measure the relationship between clusters for numerical and categorical data uniformly. Then a new cluster validity index based on the category utility function is given to measure the clustering results of mixed data. Furthermore, by utilizing the proposed framework and the modified k-prototypes algorithm, a new method for determining the number of clusters is presented for mixed data set. Experimental results on several synthetic and real data sets show that the proposed method is effective.(3) Based on the B/S architecture, a data mining system of clustering analysis is designed and implemented, whose basic function includes data input, data preprocessing, determining the number of clusters, choosing the initialization centers, clustering algorithm, visualization of clustering results and system management. Due to using the component and Ajax technology, this experimental system provides a friendly graphical interface and an open programming interface, ensuring good commonality and expandability.The above obtained contributions can provide some references for choosing the number of clusters in clustering categorical or mixed data sets, and further enrich the research of cluster analysis in data mining.
Keywords/Search Tags:Clustering analysis, Number of clusters, Information entropy, Categorical data, Mixed data
PDF Full Text Request
Related items