Font Size: a A A

Research On Key Technology Of Clustering Analysis Optimization

Posted on:2013-09-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H WangFull Text:PDF
GTID:1228330398998912Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important research area of data mining, clustering analysis can effectivelyhelp us to analyze the distribution of data, understand the characteristics of data,determine the interesting classes and find the hidden structure in the data, in order to dofurther analysis and make use of the data. In this dissertation, In order to overcome theexisting shortcomings of some clustering algorithms, solutions combining with particleswarm optimization to solve the problem such as the need to manually set the algorithminitial parameters and how to improve the clustering performance of some clusteringalgorithm are discussed and proposed. The method how to expand and use pairwiseconstraints as priori information in the process of clustering to improve the clusteringaccuracy is studied. According to the characteristics of text data such ashigh-dimensional and sparse, how to improve the performance of text clustering isstudied. Research work in this dissertation has certain theoretical research value andreality significance of application. This dissertation specifically covers the followingaspects:1)An improved particle swarm optimization(PSO) based K-Means algorithm witha simplified particle encoding rule was proposed. It can effectively overcome theshortcoming such as existing PSO based K-Means algorithm always calculates slowlyand the clustering results are unstable especially when the samples have highdimensions and the range of each dimension value varies widely, it will cause the searchspace of particles to be large and affects the convergence rate and clustering effect in thelimited time iterative search. When some dimensions take values which do not fit withthe real situation, it will lead to empty clusters. When the data set is large, a two-stepshybrid algorithm which combines the advantages of hierarchical clustering, partitioningclustering and PSO was presented. High quality sub-clusters were generated byhierarchical agglomerative clustering in the first step. Then these sub-clusters were usedas the search space of candidate centroids of PSO K-means. Chaotic idea wasintroduced to keep the diversity of particle swarm to avoid premature. Experimentalresults on several UCI data sets and text data sets show that algorithm overcomes theproblems such as sensitive to initial cluster centroid and converges fast. It caneffectively restrain premature phenomenon and the quality of the clustering results isimproved significantly.2)A solution which uses PSO to automatic search for the appropriate initialparameters and obtains the corresponding clustering result was presented. It solves the problems of some clustering algorithms that they need to manually set the initialparameters and improper choice of parameters will seriously affect the clusteringperformance. A new method combining PSO and Fuzzy C-Means(FCM) is presented.By a simple and effective particle encoding method, the best initial centroids and fuzzyweighting exponent are both searched in the process of PSO. It solves the problem ofFCM which is sensitive to the initial centroids and the choice of fuzzy weightingexponent. Density-based clustering algorithm can find clusters of arbitrary shape, butneed to manually determine the parameters MinPts and Eps. PSO and DBSCANalgorithm are combined. By searching the Eps value with a fixed value of MinPts, thealgorithm automatically finds the K clusters according to a fitness function calculatedby the difference between the number of cluster sets in the clustering result and theinput expected number of cluster sets and eventually get the best Eps value and thecorresponding K clusters. It solves the problem of DBSCAN algorithm which is hard todetermine the input parameter of Eps and MinPts.A framework for the validation ofclustering validity indexes based on particle swarm optimization based clustering isdesigned on the basis of research on PSO and clustering validity indexes. The initialcluster centroids and the possible number of clusters are both encoded, then clustervalidity indexes are used as fitness function of PSO. It can automaticly determine thebest partitioning number of dataset. Also it can be applied to verify the performance ofdifferent clustering validity indexes.A number of commonly used cluster validityindexes such as Sil, DB and IGP are tested to compare their performance in clusteringon several UCI data sets.3)In order to utilize pairwise constraints to improve the clustering performance, anovel approach called semi-supervised particle swarm optimization based clusteringalgorithm was proposed。An improved Floyd algorithm was used to expand theoriginal Must-link and Cannot-link pairwise constraints. Then pairwise constraints asprior knowledge are used to adjust the dissimilarity matrix, particle swarm optimizationbased K-means clustering is run based on the modified dissimilarity matrix. Finallystatistical information aboout the pairwise constraints violations in the clustering resultsis considered in calculating the fitness value to guide the PSO search. The experimentalresults on several UCI data sets show that the method proposed can effectively use afew pairwise constrains to improve the clustering accuracy. The overall clustering resultis better than the result of semi-supervised affinity propagation algorithm.4)According to the characteristics of text data such as high-dimensional andsparse, this dissertation puts forward a new kind of initial centroid selection method. It dynamically calculated statistical information in the clustering process from thedocuments which had been partitioned and not partitioned, these statistical informationwill be applied to the next step of partition. It will gradually detect the dense area of theremaining unpartioned documents, if coverage rate of the dense area is less than acertain threshold, this area will be choosen as an initial centroid. After generating thepredetermined number of initial cluster centroid set, the remaining documents wereassigned to their nearest clusters, finally do further optimization of the clustering resultby the criterion function. At present the typical text clustering algorithms always need toset some thresholds manually. The threshold of algorithm proposed are all generated inthe clustering process by dynamic statistics of the partition information of the data sets.It can avoid the blindness of setting threshold by people’s experience. It can welleliminate the influence of the edge points and noise points, it can adapt to the data setswith different class size and unbalanced density distribution. The comparation with thefamous CLUTO clustering tool set shows that the method proposed in this dissertationachieved better clustering accuracy and stronger robustness on different data sets.Research on how to expand the pairwise constraints and make full use of pairwiseconstraints to guide text clustering is presented based on our previous research. Asemi-supervised text clustering algorithm was proposed. First pairwise constraints wereembedded in the document similarity matrix, then pairwise constraints were effectivelyapplied in the process of initial centroids selection, remaining documents partition andclustering result optimization. Experiments on both Chinese and English text data setsshow that the proposed semi-supervised text clustering method can effectively use a fewamount of pairwise constraints to improve clustering performance.
Keywords/Search Tags:Clustering, particle swarm optimization, clustering validity, semi-supervised clustering, text clustering
PDF Full Text Request
Related items