Research On Determining The Number Of Clusters Based On Information Entropy

Posted on:2012-04-04

Degree:Master

Type:Thesis

Country:China

Candidate:X W Zhao

Full Text:PDF

GTID:2218330368489610

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Clustering analysis, viewed as a kind of unsupervised learning, is a fundamental means of data granulation, and information compression. It is also an important tool and method in machine learning and data mining research. There have been a lot of clustering algorithms developed in the data mining research community. Interesting applications of clustering can be found in bioinformatics, web data analysis, information retrieval, text mining, and scientific data exploration, to name only a few major areas.However, most of the algorithms above-mentioned need a user-specified number of clusters or implicity cluster number control parameters in advance. Unfortunately, in many situations, how many clusters exist in the given data is unknown and needs to be estimated from data themselves. Therefore, identifying the number of clusters in a data set, a quantity often labeled k, is a fundamental and important topic in clustering analysis.The problems of how to determine the number of clusters in clustering analysis are focally investigated in this paper. The main contributions of this paper are summarized as follows:(1) Based on the ideas of partitional and hierarchical clustering, an algorithm is proposed to determine the best number of clusters for categorical data, and the corresponding time complexity is analyzed. Experimental results on real world datasets of UCI demonstrate the proposed algorithm is effective.(2) Aim at mixed data, this paper presents a theoretic framework based on information entropy, which can be used to measure the relationship between clusters for numerical and categorical data uniformly. Then a new cluster validity index based on the category utility function is given to measure the clustering results of mixed data. Furthermore, by utilizing the proposed framework and the modified k-prototypes algorithm, a new method for determining the number of clusters is presented for mixed data set. Experimental results on several synthetic and real data sets show that the proposed method is effective.(3) Based on the B/S architecture, a data mining system of clustering analysis is designed and implemented, whose basic function includes data input, data preprocessing, determining the number of clusters, choosing the initialization centers, clustering algorithm, visualization of clustering results and system management. Due to using the component and Ajax technology, this experimental system provides a friendly graphical interface and an open programming interface, ensuring good commonality and expandability.The above obtained contributions can provide some references for choosing the number of clusters in clustering categorical or mixed data sets, and further enrich the research of cluster analysis in data mining.

Keywords/Search Tags:

Clustering analysis, Number of clusters, Information entropy, Categorical data, Mixed data

PDF Full Text Request

Related items

1	The Research Of Ant-Based Clustering Algorithm For Data Sets With Mixed Attribute
2	Studies On Clustering Algorithms For Categorical Data
3	Research On Interpretable Clustering Algorithms For Categorical Data
4	Algorithms Implementation Of Determining The Number Of Clusters And Initial Cluster Centers For Mixed Data
5	Research On Subspace Clustering Algorithm For Categorical Data
6	Study Of Algorithms For Clustering Categorical Data
7	Research On Density-Based Clustering Algorithm For Numerical Big Data
8	A Study Of The Clustering Algorithm For Mixed Data
9	The Research On Clustering Algorithm For Categorical Data Based-on Rough Set
10	Research On Partitioning Clustering Algorithms For Data With Mixed Numerical And Categorical Attributes