Font Size: a A A

The Research Of Clustering Ensemble Based On Genetic Algorithm And Co-association Matrix

Posted on:2019-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:K C ZhengFull Text:PDF
GTID:2428330545470236Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of clustering analysis,the research of clustering ensemble is becoming more and more mature,but it also faces many problems.Due to the lack of prior information of data classification,the basic clustering is obviously affected by clustering algorithms.The co-association matrix that most consistent integration functions used only takes account of the probability of data pair appearing in same cluster,and the utilization rate of basic clustering is not high.The time and space complexity of clustering ensemble is high In this paper,we studied on three aspects of clustering ensemble:consistent integration function,co-association matrix and the parallel application of algorithm.Using improved genetic algorithm as consistent integration function guarantees the accuracy and diversity of basic clusterings.The improved co-association matrix provides more information for consistent integration function.Both of them improve the results together.Finally,using the MapReduce framework on the Hadoop platform to parallelize the algorithm improves the efficiency of the algorithm.At the same time,the parallel algorithm is applied to medical research data sets.The superiority of the algorithm provides guidance for early cancer detection and prevention.The specific research results are as follows:(1)We proposed a clustering ensemble algorithm based on genetic algorithm(CEGA).Aimed at the high accuracy and high diversity requirements of base clustering,the genetic algorithm is adopted as the consistent integration function in clustering ensemble.The fitness function is designed according to the target of clustering ensemble,and the selection operator is designed according to the biggest number of overlapping elements in base clusterings.The influences of fitness function and selection operator in genetic algorithm on CEGA is analyzed.At the same time,the CEGA was compared with other comlon clustering ensemble algorithms on data sets to prove its superiority.(2)We proposed a clustering ensemble algorithm based on genetic algorithm and co-association matrix(CEGACM).Based on the CEGA,the algorithm improved the co-association matrix and applied it to fitness function of genetic algorithm.The improved co-association matrix recalculated the probability of data pair appearing in same cluster according to probability distribution and the probability of data pair appearing in different clusters.We used the most of the information given by base clusterings and improved the effectiveness of algorithm.At the same time,we analyzed the parameters setting of CEGACM through experiments,and compared it with other common clustering ensemble algorithms on data sets,and the experimental results are obviously better than common clustering ensemble algorithms.(3)The CEGA and CEGACM are implemented on Hadoop platform by using MapReduce model.Combining the parallelizable of clustering ensemble and genetic algorithm,we added two MapReduce processes on the proposed algorithms respectively to generate base clusterings in parallel and realize the parallelization of consistent integration function.The parallel phase of consistent integration function not only designs map function and reduce function as the stage of base clusterings generation,but also joins the Combine operation to reduce the node communication,the running time and improve the eff-iciency of clustering ensemble.At the same time,the parallel CEGACM is applied to medical data sets for cancer classification,providing guidance for early cancer detection and prevention.
Keywords/Search Tags:clustering ensemble, genetic algorithm, co-association matrix, MapReduce, parallel computing
PDF Full Text Request
Related items