Font Size: a A A

Studies On Clustering Algorithms For Categorical Data

Posted on:2017-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y H HuangFull Text:PDF
GTID:2308330485478420Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Cluster analysis is one of an important technology in data mining, which follow similar principles to classify data. In unsupervised learning, clustering is to divid massive data into various sub-categories effectively, so that the object of the same class as the high degree of similarity, dissimilarity between different categories of objects as high as possible. Currently, the clustering of numerical data has achieved good results, such as the classical k-means algorithm has been widely used and promoted. However, in real life, there are a lot of categorical data. Because categorical data are not geometric properties of numerical data, it is impossible to direct numerical computation. Thus, clustering categorical data, relatively speaking, become more complex, it is one of the learning algorithm is important and intractable problems. In recent years, many scholars explored and improved categorical data clustering.K-means algorithm for categorical data is not available for this problem, k-modes algorithm on this basis was expanded. In this paper, some problems about the k-modes clustering algorithm have been studied, and compared and analyzed various existing improved k-modes algorithm. Traditional k-modes algorithm defined dissimilarity between each two objects by 0-1 matching method, not only did the distribution of the entire data set into account, but also ignore the impact of the relationship between the properties of dissimilarity, lead to differences in metric inaccurate. To solve the above problem, the research results of this paper are three aspects:(1) From an angle of mutual information, based on interdependence redundancy theory defines the distance between the different attribute values, and further improve the distance formula Hong jia raised. The improved distance is determined by the internal distance and external distance, internal distance reflects attribute values between objects differences in degree, external distance reflects the impact of the other attributes.(2) That applied new distance metric based on interdependence redundancy theory to k-modes algorithm, and analyze the time complexity of the improved algorithm. Comparison of experimental and k-modes algorithm based on other distance metric, the results show that the interdependence redundancy metric k-modes algorithm can effectively deal with large-scale data, but also can improve the accuracy of clustering algorithms.(3) From the total property value starting, now give a new dissimilarity measure based on structural similarity computing model, and applied to the traditional k-modes algorithm, and analyzes the improved algorithm time the complexity. This method not only considers the differences between their own attribute values, but also takes their condition on other attributes into account. Experimental results show that compared with the traditional k-modes algorithm and Ahmad algorithm, based on the new k-modes algorithm dissimilarity measure not only has a good ability to identify clusters and improve the accuracy of the algorithm effectively.Contribution of this study not only enriches the dissimilarity measure for categorical data, and to some extent, provide a new way to support categorical data clustering.
Keywords/Search Tags:clustering analysis, categorical data, dissimilarity measure, k-modes algorithm, Interdependence redundancy theory
PDF Full Text Request
Related items