Studies On Hierarchical Clustering For Categorical Data

Posted on:2015-08-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y C Zhao

Full Text:PDF

GTID:2348330509959016

Subject:Computer application technology

Abstract/Summary:

Clustering is one of the important technologies to extract or mine knowledge from large amounts of data. The data objects to be clustered are usually represented as categorical data and numerical data. Most of the existing clustering algorithms focuse on numerical data and the research on categorical data is relatively less. However,many data objects in practical application are represented as categorical data. On the other hand, owing to the following several advantages, hierarchical clustering has more potential to develop than partitional clustering:(1) Hierarchical clustering outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by partitional clustering;(2) Hierarchical clustering does not require users to specify any parameters, such as the optimal number of clusters in partitional clustering. Therefore, this thesis is devoted to develop a new efficient hierarchical clustering algorithm for categorical data.Firstly, the analysis on existing categorical data clustering is made about its disadvantages, and the comparison on hierarchical clustering and partitional clustering is given, analyzing the advantages and disadvantages of hierarchical clustering algorithm.Secondly, by learning the advanced experience of building the evolutionary tree from biology, evolutionary trees are constructed by using its important conclusions of the maximum likelihood method. We proposed a new hierarchical clustering algorithm for categorical data, which is based on the maximum likelihood, named as HAC_ML. HAC_ML can cluster categorical data directly on raw data, and overcome the shortcomings of hierarchical clustering which can’t go back. Tests on datasets show that, HAC_ML is stable and efficient in dealing with categorical data.Thirdly, to make HAC_ML algorithm for further improvement. RF distance information is used to constrain the search process to reduce the number iterations of HAC_ML, in order to reduce the effect of the algorithm running time. In the experiments, the experimental results of improved algorithm combined RF distance compared with the experimental results of HAC_ML, show that the improvedalgorithm without changing the accuracy of clustering results, improves the efficiency of the speed of the algorithm to find the optimal clustering results.Finally,summary this paper and describe the further research in the future.

Keywords/Search Tags:

Categorical data, Hierarchical clustering, Evolutionary tree, Maximum likelihood

Related items

1	Outlier Detection For Categorical Data Based On Attribute Grouping Weight And Maximum Likelihood
2	The Study Of Clustering Data With Categorical Attributes In Data Mining
3	Study Of Methods Of Constructing Evolutionary Trees With DNA Sequences
4	The Research Of Application And Optimization Of Gaussian Mixture Model In Data Clustering
5	Similarity Measures And New Clustering Methods For Categorical Sequences
6	A Study On Clustering Algorithms For Categorical Data With Applications
7	Automatic categorical data clustering and spatial data clustering by consecutive resolution refinement
8	The Research On Clustering Algorithm For Categorical Data Using Quantum Mechanics
9	An Evolutionary Model For Maximum Likelihood Alignment Of DNA Sequences
10	Studies On Clustering Algorithms For Categorical Data