Font Size: a A A

ESCHCD: Entropy-based Algorithm For Subspace Clustering With High Dimensional Categorical Datasets

Posted on:2012-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y L DuFull Text:PDF
GTID:2218330338453279Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the large-scale data storage technology, information technology and network technology, more and more people are suffering from the data overflow dilemma, which lack of knowledge. In order to meet the increasing demand for information, data mining is widely used in many fields. Cluster analysis with automatic classification of data has become one of the most important of data mining tools. Existing clustering algorithm processing low-dimensional dataset has a good effect, while the high-dimensional cluster analysis of numerical data has also made certain achievements. However, the high-dimensional categorical data, studies have not paid enough attention, and the special nature of categorical data, making the existing clustering algorithms not meet the requirements of dealing with categorical data.Categorical data typically suffer from exhibit sparsely in a space of very high dimension and limited measuring levels, both of them make conventional dissimilarity measures are inadequate. To solve those problems, a new high dimensional categorical clustering algorithm is proposed in this paper, called ESCHCD (Entropy-based Algorithm for Subspace Clustering with High Dimensional Categorical Datasets,ESCHCD). We designed an effective and unsupervised objective function to determine the subspace associated with each cluster by considering the entropies of the matched subspace and the noise subspace. Meanwhile, we also proposed an average entropy-based global optimization method to find the best clustering results. We demonstrate the efficiency and scalability of ESCHCD by comparing with other categorical clustering algorithms'experiments on real and synthetic categorical sets.
Keywords/Search Tags:categorical dataset, entropy, high-dimensional data, subspace clustering
PDF Full Text Request
Related items