Algorithm Study On Clustering

Posted on:2006-03-29

Degree:Master

Type:Thesis

Country:China

Candidate:T Dai

Full Text:PDF

GTID:2168360155474112

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Clustering analysis is one of the main functions in Data Mining and KnowledgeDiscovery, which groups data sets into classes by nature and gives a character depic-tion for every class. Assumed that we have known the number of clusters, we oftenuse probability-based clustering algorithms to partition data into classes, making thesimilarity as small as feasible in the same class, and as big as possible between classes.However, probability-based clustering algorithms do not directly answer the question,how many clusters in a given data set? In this thesis, while analyzing the previousmethods for determining the number of probability-based clustering, we introduce animproved Monte Carlo Cross-Validation algorithm (iMCCV) and attempt to solve theposterior probabilities spread problem, which cannot be resolved by the Monte CarloCross-Validation algorithm. Furthermore, we present a hybrid approach to determinethe number of probability-based clustering by combining the iMCCV algorithm andthe parallel coordinates visualization technology.In general, there are three kinds of data source types, i.e., demographic data, in-dividual behavioral data and the psychographic or attitudinal data. However, the indi-vidual behavioral data type can get more efficient effect to predict the future behaviorthan any other types of data. The concept of individual includes, in a broad sense,humans, animals, organisms, organizations, natural phenomena and mechanical sys-tem etc. There are many examples of this type of data, for instance, supermarket data,credit card data, telephone record data and clinical record data and so on. Moreover, wemight view all business related data as this type of data. This type of data are non-vectorin nature and may vary in size from individual to individual.However, the traditional clustering algorithms based on distance or similarity arevector-based, viz., transforming raw data into fixed-dimension vector data. These meth-ods are unfit for individual behavioral data. If we use these methods to group data, therewill be a lot of information lost and will lead to the clustering inaccurate. So, we needsome new clustering algorithms to partition individual behavioral data.We present the Fuzzy Gaussian Mixture Model (FuzzyGMM) algorithm for gen-eral individual behavioral data and the Dual Gaussian Mixture Model (DualGMM) al-gorithm for multi-peak individual behavioral data.Based on the algorithm research, we design and implement a clustering miningprototype VisMMC based on parallel coordinates visualization technology.

Keywords/Search Tags:

data mining, clustering, probability-based clustering, clustering num-ber, individual behavioral data, visualization

PDF Full Text Request

Related items

1	Research And Implementation Of Clustering Algorithm For Multidimensional Data Sets
2	Design And Implement Of Web Document Clustering System
3	Research On Clustering Algorithms In Traffic Domain
4	Research On Data Streams Clustering Methods
5	Technology Research, Data Mining Based On Fuzzy Clustering
6	No Default Categories For Large Amount Of Data Clustering Algorithm Research
7	Clustering Algorithm And Analysis Of Customer Loyalty
8	The Research On The Method To Measure The Validity And To Abstract Knowledge Of Clustering
9	The Research And Development Of The Visualization Clustering System Oriented To PDM
10	Research On Dynamic Clustering And Incremental In Data Mining