Font Size: a A A

Study Of Algorithms For Clustering Categorical Data

Posted on:2009-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2178360242997738Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the development of databases and Internet, the volume in data collection and storage in the recent decades grows explosively. How to analyze, explore and discover useful knowledge rapidly and efficiently from these data becomes the focus of scientists. To deal with this challenge, clustering analysis has become an active area on data mining techno sphere.In the paper, the technology of clustering analysis is introduced in detail, is disserted, involving the methods and characteristics of clustering used in data mining and the methods for evaluating the clustering results. Being of varies datasets, clustering analysis will be capable of deal with diversity of data types. The paper put emphasis on algorithms for clustering categorical data(CCA).The researches on related to categorical data, it is focused on the partition approach. At first, based on partition of categorical data: k-modes clustering algorithm and its variations are introduced with their advantages and disadvantages. On the basis of the partition similarity, a new definition for the accuracy of k-modes algorithm is presented and applied in setting up the Cooperative Learning groups. Then the fuzzy k-modes clustering algorithm is introduced and based on the attributes weighted are presented for the different contribution of each attribute of the data set to the clustering. Next, proximate k-median clustering algorithm on categorical data is involved. With a new fitness, the evolutionary strategy is used to optimize the weight matrix and the clustering accuracy based on the partition similarity is used to evaluate the clustering result. The experiment gives a better result with the soybean disease data set as the input samples.Secondly, the elementary quality of entropy is characterized. Three entropy-based algorithms are simply described. Next, gravity model is introduced into the new clustering algorithm. Our algorithm used the incremental entropy as the radius and the cluster as the quantity. The three rules of the category utility, the expected entropy and the purity were measure the result of clustering respectively. To be better contrast, k-mode and COOLCAT were used. Every algorithm was runned 10 times on the UCI datasets, which average serves as finaly results to compare with them.Following, a new subspace clustering without overlap is proposed. As a rule, there is possible to no cluster in the multidimensionality space. The sum of compactness function and separation function were served as the object function. Applied to the UCI datasets, it came in diffirent clusters in their subspace sets respectively.In a word, study of entropy-based clustering and subspace clustering on categorical data will be better developped.
Keywords/Search Tags:clustering analysis, categorical data, entropy, subspace
PDF Full Text Request
Related items