Study On Three-way Decisions Clustering Ensemble Based On Spark

Posted on:2019-01-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y Chen

Full Text:PDF

GTID:2428330590465723

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Clustering analysis is used in many fields including data mining,machine learning,pattern recognition and so on.Cluster analysis is an unsupervised learning method for unlabeled data.The advantage of clustering ensemble is that it can combine multiple cluster members to provide better clustering results and improve algorithm quality and robustness.The three-way clustering representation intuitively describes which objects belong to a certain cluster and which objects do not belong to a certain cluster.Spark,as a framework for rapid data analysis in the era of big data,has developed rapidly since its birth in 2009 and has become the mainstream of big data processing tools.Therefore,to solve the clustering problem of large-scale uncertainty data,this thesis proposes a three-way clustering ensemble method based on Spark.In order to process large-scale data parallelly,a three-way ensemble model based on Spark is proposed.It mainly consists of three parts.First,the distributed affinity propagation algorithm is proposed as the cluster member.The advantage of the existing affinity propagation algorithm,shorted by AP,is that it does not require the number of clusters in advance and the algorithm has good stability.To make the AP algorithm to perform parallel computing on the cluster,the thesis improves the RDD transformation of the similarity matrix,availability matrix and responsibility matrix in the AP algorithm.Then,the OVERLAP matrix is constructed on the clustering results of the cluster members so that the objects are assigned to the corresponding clusters.Besides,the thesis build the consensus function by combining the three-way decisions and a majority voting strategy.Finally,a bunch of UCI data sets are employed to verify the model and test the performance of the distributed algorithm,the test results show that the algorithm has good speedup and sizeup.For the sake of reducing the time complexity,a consensus method based on Spark is presented by combing hypergraphs,cores and three-way decisions.The concept of core is introduced in the previous work by the project team,and it reflects the minimal granularity distribution structure agreed by all the ensemble members.The computation based on the core is much less than the computation based on the original objects.The thesis divides the objects into three types of cores: the large cores and the small cores,and non-core data objects,and proposes the corresponding processing strategies.The advantages of hypergraphs are good at reflecting the complex relationships among data,the thesis constructs hypergraph adjacency matrices to implement the consensus clustering.Comparative experiments are performed on 15 UCI datasets and 4 large-scale datasets.

Keywords/Search Tags:

Large-scale Data, Three-way Decisions, Cluster Ensembles, Spark

PDF Full Text Request

Related items

1	Research And Application Of Clustering Algorithms For Large Scale Data
2	Design And Implementation Of Spectral Clustering Algorithm For Large Scale Data
3	Study On Data Fusion Of The Large Scale Carbon Cycle Model Based On Spark
4	Performance Monitoring And Optimization For Large-scale Data Processing In Cluster
5	Research Of Large-scale Data Mining Technology Based On Spark
6	Fast Analysis Of Large-scale Wafer Inspection Data
7	A New Consensus Method For Cluster Ensembles To Improve Clustering Accuracy And Stability
8	Large-Scale Positive And Unlabeled Learning
9	Research And Implementation Of The Large Scale Cluster Anomaly Detection Technology And Data Masking Technology
10	Research On Large-scale Complex Network Community Detection Algorithm Based On Spark