Font Size: a A A

Study On Three-way Decisions Clustering Ensemble Based On Spark

Posted on:2019-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2428330590465723Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is used in many fields including data mining,machine learning,pattern recognition and so on.Cluster analysis is an unsupervised learning method for unlabeled data.The advantage of clustering ensemble is that it can combine multiple cluster members to provide better clustering results and improve algorithm quality and robustness.The three-way clustering representation intuitively describes which objects belong to a certain cluster and which objects do not belong to a certain cluster.Spark,as a framework for rapid data analysis in the era of big data,has developed rapidly since its birth in 2009 and has become the mainstream of big data processing tools.Therefore,to solve the clustering problem of large-scale uncertainty data,this thesis proposes a three-way clustering ensemble method based on Spark.In order to process large-scale data parallelly,a three-way ensemble model based on Spark is proposed.It mainly consists of three parts.First,the distributed affinity propagation algorithm is proposed as the cluster member.The advantage of the existing affinity propagation algorithm,shorted by AP,is that it does not require the number of clusters in advance and the algorithm has good stability.To make the AP algorithm to perform parallel computing on the cluster,the thesis improves the RDD transformation of the similarity matrix,availability matrix and responsibility matrix in the AP algorithm.Then,the OVERLAP matrix is constructed on the clustering results of the cluster members so that the objects are assigned to the corresponding clusters.Besides,the thesis build the consensus function by combining the three-way decisions and a majority voting strategy.Finally,a bunch of UCI data sets are employed to verify the model and test the performance of the distributed algorithm,the test results show that the algorithm has good speedup and sizeup.For the sake of reducing the time complexity,a consensus method based on Spark is presented by combing hypergraphs,cores and three-way decisions.The concept of core is introduced in the previous work by the project team,and it reflects the minimal granularity distribution structure agreed by all the ensemble members.The computation based on the core is much less than the computation based on the original objects.The thesis divides the objects into three types of cores: the large cores and the small cores,and non-core data objects,and proposes the corresponding processing strategies.The advantages of hypergraphs are good at reflecting the complex relationships among data,the thesis constructs hypergraph adjacency matrices to implement the consensus clustering.Comparative experiments are performed on 15 UCI datasets and 4 large-scale datasets.
Keywords/Search Tags:Large-scale Data, Three-way Decisions, Cluster Ensembles, Spark
PDF Full Text Request
Related items