Font Size: a A A

Research On Clustering Algorithm Of Mixed Data

Posted on:2016-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:C K QianFull Text:PDF
GTID:2308330464469345Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of information technology, we have stepped into an information age. For decades, information generation, information organization and information exchange are undergoing revolutionary change. A lot of data accumulated in every walk of life. Nevertheless, additional value from these data has not grown in step with the expansion of the data scale. Therefore, the most urgent problem for us is to discover knowledge from the massive data. Under the circumstances, data mining has attracted extensive attention worldwide. Clustering, a hot research subject of data mining, has been widely used in real society.Most of clustering algorithms are mainly studied towards the onefold attribute type. However, lots of research shows a large amount of data sets are multifold type. This leads to the failure of traditional clustering which cannot handle mixed type data sets. Hence, how to clustering those mixed type data has been a hot issue in data clustering. This thesis does a further study on clustering data with mixed attributes, and its main work are as follows:1. This thesis introduces the background and state-of-art of data mining, presents its trends, tasks and languages. Then an overview of mixed type data and clustering algorithms are introduced, which focus on similarity measurements and classical algorithms of clustering. At the same time, a survey of mixed type data clustering has also been summarized.2. A new dissimilarity measurement has been raised. Meanwhile, the connectivity of graph is applied into the new clustering algorithm, CADFSC, successfully. CADFSC gets plenty of pre-clusters in using the sensitiveness of K-Prototypes to initial data centers, and then combining or pruning operations will be applied among these pre-clusters. The iteration is stopped when conditions are met. CADFSC has advantages over K-Prototypes and three other clustering algorithms by conducting simulation. At the same time, several parameters in CADFSC are also discussed, and some recommended values about parameters are provided.3. Extends the affinity propagation algorithm to cluster mixed attributes data sets. A new distance formula is been proposed, and apply it to AP clustering algorithm, APDA. There is no virtual cluster centers which will lead to empty clusters in APDA. Meanwhile, this new algorithm considers the whole diversity of data set into distance so that we can get a better clustering result. By computing clustering entropy and algorithm execution time, APDA shows a better performance than other two clustering algorithms.
Keywords/Search Tags:mixed data, clustering, dimensional frequency, attribute distance, affinity, propagation, data mining
PDF Full Text Request
Related items