Font Size: a A A

The Research On Clustering Algorithm For Mixed Numeric And Categorical Values Based Partitioning Methods

Posted on:2011-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:W ChenFull Text:PDF
GTID:2178360308469042Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Cluster analysis is one kind of important data mining technique, and it's also a hot issue in data mining researches. Among the data types to be clustered, data with mixed numeric and categorical values is the most common one whose category property value is limited, disorder and not able to be compared. These characteristics lead to many problems. For instance, no rational dissimilarity degree can be used to describe the differences between samples. Otherwise, the method of converting the category value to numeric value cannot get effective results usually. So that many clustering algorithms used to deal with numeric property are unfit for data with categorical attributes, whereas there exist few algorithms that can deal with such kind of data, and their performance, clustering quality remain to be improved. Therefore, the exploring and improving of clustering algorithms for data with mixed numeric and categorical attributes is one of the important topics in the field of cluster analysis.From the perspective of the accuracy improvement and consumption reducing, this article analyzes advantages and disadvantages of the clustering algorithm dealing with data with mixed attributes, and investigates the problems based on k-prototypes algorithm. To reduce the influence of choosing the initial clustering centers randomly, this paper introduces a new selection method based on linear model to make them respond to data set characteristics more effectively. On the other hand, the existing dissimilarities reflecting the distances of objects cannot use the information of clustering sets effectively, especially when the volume of data increases and the data set type is complex. Aiming at resolving these problems, this article ameliorates the formula of dissimilarity, and then designs a new algorithm to dispose data with mixed numeric and categorical values.The contents of this article are as following:(1) The background outline of the subject both national and international.(2) The analysis and contrast of a few kinds of primary algorithms in clustering analysis, and the introduction of data types along with it's disposing methods in clustering process.(3) The description and the analysis of advantages/disadvantages of k-prototypes algorithm, besides advances the choosing method of initial clustering centers and dissimilarity based on it. (4)Brings forward a clustering algorithm dealing with mixed numeric and categorical values based on improved k-prototypes algorithm,and designs a simulation experimental platform on English set by Visual C++ language implementing algorithm's code, SQL SERVER establishing database to validate the new improved algorithm's performance in the round,and the experiment results indicate that it has better stability-and higher accuracy.
Keywords/Search Tags:Data mining, cluster, data with mixed numeric and categorical values, k-prototypes algorithm
PDF Full Text Request
Related items