Font Size: a A A

Algorithm Study On Clustering

Posted on:2006-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:T DaiFull Text:PDF
GTID:2168360155474112Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Clustering analysis is one of the main functions in Data Mining and KnowledgeDiscovery, which groups data sets into classes by nature and gives a character depic-tion for every class. Assumed that we have known the number of clusters, we oftenuse probability-based clustering algorithms to partition data into classes, making thesimilarity as small as feasible in the same class, and as big as possible between classes.However, probability-based clustering algorithms do not directly answer the question,how many clusters in a given data set? In this thesis, while analyzing the previousmethods for determining the number of probability-based clustering, we introduce animproved Monte Carlo Cross-Validation algorithm (iMCCV) and attempt to solve theposterior probabilities spread problem, which cannot be resolved by the Monte CarloCross-Validation algorithm. Furthermore, we present a hybrid approach to determinethe number of probability-based clustering by combining the iMCCV algorithm andthe parallel coordinates visualization technology.In general, there are three kinds of data source types, i.e., demographic data, in-dividual behavioral data and the psychographic or attitudinal data. However, the indi-vidual behavioral data type can get more efficient effect to predict the future behaviorthan any other types of data. The concept of individual includes, in a broad sense,humans, animals, organisms, organizations, natural phenomena and mechanical sys-tem etc. There are many examples of this type of data, for instance, supermarket data,credit card data, telephone record data and clinical record data and so on. Moreover, wemight view all business related data as this type of data. This type of data are non-vectorin nature and may vary in size from individual to individual.However, the traditional clustering algorithms based on distance or similarity arevector-based, viz., transforming raw data into fixed-dimension vector data. These meth-ods are unfit for individual behavioral data. If we use these methods to group data, therewill be a lot of information lost and will lead to the clustering inaccurate. So, we needsome new clustering algorithms to partition individual behavioral data.We present the Fuzzy Gaussian Mixture Model (FuzzyGMM) algorithm for gen-eral individual behavioral data and the Dual Gaussian Mixture Model (DualGMM) al-gorithm for multi-peak individual behavioral data.Based on the algorithm research, we design and implement a clustering miningprototype VisMMC based on parallel coordinates visualization technology.
Keywords/Search Tags:data mining, clustering, probability-based clustering, clustering num-ber, individual behavioral data, visualization
PDF Full Text Request
Related items