Font Size: a A A

Research On Subspace Clustering Algorithm For Categorical Data

Posted on:2017-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:H R ZhangFull Text:PDF
GTID:2308330503483628Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the ripe development and widely promotion of the science and technology in industry and academia, the scale of data is sharply increasing in a rate of dozens of times than before in the past decades. Therefore, how to cope with the large scale data becomes one of the most eye-catching issues in academia. Clustering analysis, an unsupervised learning method, plays an important role in data mining. Actually, it is found that the simple and continuous numerical data cannot cover all the contents well, with this phenomenon, categorical data and mixed data springing up. The similarity calculation of different objects is an inevitable and important step in cluster analysis process. Due to its nature advantage in mathematics, the similarity of the numerical data is easily available so fruitful results of the corresponding clustering algorithm has been achieved. However, when it comes to categorical data, the similarity cannot be obtained by the conventional methods without the geometric characteristics of numerical data. Nowadays, it is a common practice to get the similarity of categorical data by dissimilarity measure and information entropy. K-modes, K-modes-GCC, Fuzzy K-modes, Genetic fuzzy K-modes, NFKM may be on behalf of the first group, and both COOLCAT and ROCAT use information entropy to get their similarity. Some of them are based on full space, some of them are sensitive to the input parameters, and still others are sensitive to the sequence of input data. ROCAT, an effective clustering method for categorical data, has overcame some of the problems to an extent. They set up a data compression model which shows its good performance both on synthetic data and real word data, however with a high complexity. As for accuracy, there is still some room to improve although its performance is better than existing methods.Based on the above-mentioned factors, we carried out the following works:(1).Putting forwards a Subspace Clustering Method Based on Mutual Information and Categorical Data(MICAT). The original clustering algorithm is divided into two stages of clustering and classification. Decomposing the source data into in-sample data and out-of-sample data with a random sampling algorithm, clustering the in-sample data with ROCAT to get sample clusters, assigning the others into the sample clusters to finish the clustering. It is necessary to load all the data one-time in ROCAT, and several iterations would be processed after that. Meanwhile, in MICAT, the data loading and iterations are only happened to the in-sample data and the similarity calculation would occur one by one, so MICAT can reduce the time complexity and space complexity effectively.(2). Developing a Novel Subspace Clustering Method Based on Data Cohesion Model(SCDCM). Inspired by the law of universal gravitation, supposing there is a cohesion force between each member of a cluster, similarity to the traditional algorithm with entropy, we consider the more similar objects in a cluster, the bigger their cohesion force is. With the principle of minimum information entropy, candidate clusters can be generated, and with the rule of maximum cohesion of the candidate clusters, we can get the best pure clusters, therefore SCDCM improve the accuracy effective.Experimental results show that MICAT can reduce the complexity greatly without accuracy reduction. Meanwhile, we compared SCDCM to ROCAT, NFKM, SUBCAD, CLICKS, CLIQUE, DHCC and AT-DC on synthetic data and real-word data, results indicate that SCDCM can enhance the accuracy efficiently with the original advantages of ROCAT.
Keywords/Search Tags:Categorical data, Complexity, Accuracy, Mutual Information, Entropy, Dissimilarity measure
PDF Full Text Request
Related items