Font Size: a A A

Studies On Clustering Algorithms For Categorical Data

Posted on:2011-04-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:F Y CaoFull Text:PDF
GTID:1118360305495316Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an unsupervised learning method, cluster analysis is an impor-tant research direction in the field of machine learning. At the same time, clustering technology is also an important tool and method in the data mining. For the numerical data, methods and technology of cluster anal-ysis has been well-explored, however, in the real world, the categorical data is still widespread. As the categorical data lacks of the inherent geometric properties, distance functions between data points cannot be defined naturally, the corresponding clustering model and its algorithm design differentiate from that of the numerical data. In recent years, the problem of clustering categorical data has attracted more attention.In this thesis, some problems of clustering categorical data are in-vestigated, which includes choosing the initial cluster centers, similarity measure, and clustering algorithm for high dimension, large, time-evolving categorical data. The main contributions of this thesis are summarized as follows:(1) In clustering algorithms, choosing a subset of representative ex-amples is very important in data set. In this chapter, based on the fre-quency of attribute values, the average density of an object is defined. Furthermore, a novel initialization method for categorical data is pro-posed, in which the distance between objects and the density of the ob-ject is considered. We also apply the proposed initialization method toκ-modes algorithm and Fuzzyκ-modes algorithm. Experimental results illustrate that the proposed initialization method is superior to random initialization method and can be applied to large data sets for its linear time complexity with respect to the number of data objects.(2) In this chapter, the limitations of the simple matching dissim-ilarity measure and Ng's dissimilarity measure are analyzed using some illustrative examples. Based on the idea of biological and genetic taxon-omy, a new dissimilarity measure for theκ-modes algorithm is defined. A distinct characteristic of the new dissimilarity measure is to take ac- count of the distribution of attribute values on the whole universe. A convergence study and time complexity of theκ-modes algorithm based on new dissimilarity measure indicates that it can be effectively used for large data sets. The experimental results show the effectiveness of the new dissimilarity measure, especially on data sets with biological and genetic taxonomy information.(3) As the size of data growing at a rapid pace, clustering a very large data set inevitably incurs a time-consuming process. In this chap-ter, based on the frequency of attribute values in a given cluster, a novel similarity measure is proposed to allocate each unlabeled object into the corresponding appropriate cluster for clustering categorical data. Fur-thermore, a labeling algorithm for categorical data is presented, and its corresponding time complexity is analyzed as well. The effectiveness of the proposed algorithm is shown by the experiments on real world data sets.(4) Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a data set. In this chapter, a weightingκ-modes algorithm is presented for subspace cluster-ing of categorical data and its corresponding time complexity is analyzed as well. In the proposed algorithm, an additional step is added to theκ-modes clustering process to automatically compute the weight of all dimensions in each clusters by using information entropy. Furthermore, the attribute weighting can be used to identify the subsets of important dimensions that categorize different clusters. The effectiveness of the pro-posed algorithm is demonstrated with real data sets and synthetic data sets.(5) A fundamental assumption often made in unsupervised learn-ing is that the problem is static, i.e., the description of the classes does not change with time. However, many practical clustering tasks involve changing environments. In this chapter, we propose a generalized clus-tering framework for categorical time-evolving data, which is composed of three algorithms:Drifting-Concept Detecting Algorithm detecting the difference between the current sliding window and the last sliding window, Data Labeling Algorithm deciding the most appropriate cluster label for each object of the current sliding window based on the clustering results of the last sliding window, and Cluster Relationship Analysis Algorithm analyzing the relationship between clustering results at different time. Furthermore, the time complexity analysis of the proposed algorithms in-dicates that they can be effectively used in large data sets. The proposed framework is demonstrated on real data set, and is shown to not only accurately detect the drifting-concepts but also attain clustering results of better quality.(6) Based on B/S architecture, an intelligent data mining system designed and implemented, which include data input, data processing, statistical analysis, data mining, graph visualization, system maintenance. Use of component-based development and Ajax technology ensure that the system has good scalability and interactivityThe above mentioned contributions has further enriched the cluster analysis of categorical data, and provide technology support for the studies of biological information data, Web data, customer transaction data.
Keywords/Search Tags:clustering analysis, categorical data, initial cluster centers, max-min distance algorithm, object density, similarity measure, biological taxonomy, convergence, sampling techniques, rough set theory, data labeling, information entropy
PDF Full Text Request
Related items