Font Size: a A A

Studies On Hierarchical Clustering For Categorical Data

Posted on:2015-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y C ZhaoFull Text:PDF
GTID:2348330509959016Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering is one of the important technologies to extract or mine knowledge from large amounts of data. The data objects to be clustered are usually represented as categorical data and numerical data. Most of the existing clustering algorithms focuse on numerical data and the research on categorical data is relatively less. However,many data objects in practical application are represented as categorical data. On the other hand, owing to the following several advantages, hierarchical clustering has more potential to develop than partitional clustering:(1) Hierarchical clustering outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by partitional clustering;(2) Hierarchical clustering does not require users to specify any parameters, such as the optimal number of clusters in partitional clustering. Therefore, this thesis is devoted to develop a new efficient hierarchical clustering algorithm for categorical data.Firstly, the analysis on existing categorical data clustering is made about its disadvantages, and the comparison on hierarchical clustering and partitional clustering is given, analyzing the advantages and disadvantages of hierarchical clustering algorithm.Secondly, by learning the advanced experience of building the evolutionary tree from biology, evolutionary trees are constructed by using its important conclusions of the maximum likelihood method. We proposed a new hierarchical clustering algorithm for categorical data, which is based on the maximum likelihood, named as HAC_ML. HAC_ML can cluster categorical data directly on raw data, and overcome the shortcomings of hierarchical clustering which can't go back. Tests on datasets show that, HAC_ML is stable and efficient in dealing with categorical data.Thirdly, to make HAC_ML algorithm for further improvement. RF distance information is used to constrain the search process to reduce the number iterations of HAC_ML, in order to reduce the effect of the algorithm running time. In the experiments, the experimental results of improved algorithm combined RF distance compared with the experimental results of HAC_ML, show that the improvedalgorithm without changing the accuracy of clustering results, improves the efficiency of the speed of the algorithm to find the optimal clustering results.Finally,summary this paper and describe the further research in the future.
Keywords/Search Tags:Categorical data, Hierarchical clustering, Evolutionary tree, Maximum likelihood
PDF Full Text Request
Related items