Font Size: a A A

The Study And Application Of Some Issues For Cluster Analysis

Posted on:2010-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z L HuiFull Text:PDF
GTID:2120360275485568Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Clustering is a very important basic tool for knowledge discovery,machine learning and data mining.It is different from the traditional classification methods.It divide the data sets which have no labels into different clusters following the similarity of the information.A cluster is a collection of data objections that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.Therefore cluster analysis is an unsupervised learning process.Fuzzy clustering and Gaussian mixture model are the most widely used method of two clustering analysis.In this thesis,we do the following aspects of work,for some basic issue of these types of clustering methods.In the first chapter,we do a comprehensive overview for the main cluster analysis methods existing in the literature.We analyze a variety of advantages and shortcomings of the algorithms for partitioning methods,hierarchical method,density-based method, grid-based methods and model-based method.In second chapter,we mainly make discussion on FCM and PCM as well as its advantages and disadvantages.First,we use the theretical survey and numerical experimentation to investigate the advantages and disadvantages of FCM and PCM clustering algorithms.Then,we studied the fuzzy clustering algorithm which was combination by FCM and PCM by J S Zhang et al.proposed.The numerical experiments show that the algorithm can effectively exert the advantages of FCM and PCM,and overcome their own shortcomings,and the clustering effects are more desirable than a single FCM or PCM.In the last part of this chapter,the numerical experiments show that FCSS can't effectively cluster to concentric spherical the shell-like cluster of data which increased the noisy,and has analyzed the causes of this phenomenon in theory.FCSS that adopt the gradient method and alternative optimization strategy is easy to run into a local optimum and affect the clustering effect. Therefore,we propose to use genetic algorithm to search optimal solution of FCSS function.And in order to accelerate the convergence rate of genetic algorithm,we combined the FCSS algorithm with the genetic algorithm skillfully,putting forward the GA-FASS so-called bases on the combination of genetic algorithm and FCSS.A large number of numerical data experiments show that GA-FCSS is effective.It can give a good separation to shell data(including concentric spherical shell) containing all kinds of noise shell.And the spherical shell centre and the radius are closed to the real value,in addition,the classification of data points is almost veracious.Chapter 3 make discuss based on the clustering algorithm of the statistical model and mainly choose the more practicably Gaussian mixture model in current,which is a semi-parameters clustering method.First,we compare with the Gaussian mixture and clustering issues,then exported the EM algorithm which solves the relevant parameters of maximum likelihood estimation,and the examples of numerical experiments shows that the EM algorithm is effective for ellipsoid spherical solid data-Class.Finally,based on the Gaussian mixture model,we studied the validity problem of the clustering,which is to be how many kinds in the clustering of data,corresponding to how many normal components in the mixture model.We mainly study the MML-EM algorithm based on the minimum message length(MML) criterion,and this algorithm can handle simultaneously two questions that are the model selection and the estimated parameters of the Gaussian mixture model. Numerical experiments show that MML-EM algorithm can choose the optimum numbers of cluster containing in data with the higher corrected rate when we use the integer close to the real value to initialize the numbers of cluster.But it may be deviation for the estimation of the prototype of the clusters.When we use the integer far from the real value to initialize the numbers of clusters,the corrected rate will be reduced speedly and there will be the tendency to overestimate the correct numbers of clusters.We theoretically analyses the MML criterion again, and find out the reason for occurring these outcomes.Then an improved MML-EM algorithm was proposed.The simulated numerical experiences showed that the improved algorithm, called IMML-EM algorithm,not only inherit all advantages of the original MML-EM algorithm,such as the robustness to initial values of parameters(not including the number of components),etc.,but also,to some extent,it can efficiently avoid a main drawback of the original MML-EM algorithm:the tendency to overestimate the correct number of components.In particular,the correct number of clusters and the parameters of the mixture model can be estimated very accurately by the improved algorithm with a larger initial number of clusters and random parameters of model.This property are more useful in the practice,because we have to begin search the optimum number of clusters from a larger integer in order not to miss the correct number of clusters if we have no information related to it.
Keywords/Search Tags:fuzzy c-means clustering, possibilistic c-means clustering, GA-FCSS algorithm, Gaussian mixture model, EM algorithm, MML-EM algorithm
PDF Full Text Request
Related items