Research And Application Of Parallelizing Affinity Propagation Clustering Based On Distributed Computing

Posted on:2016-08-15

Degree:Master

Type:Thesis

Country:China

Candidate:X Zhang

Full Text:PDF

GTID:2308330464969393

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Clustering is a common data mining technology which aggregates data objects based on their characteristics, so that similar data objects cluster together. With the rapid development of traditional Internet and mobile Internet, the data generated by business company and user exploding growth, when faced with massive data mining, traditional clustering algorithms are extremely time consuming, they cannot effectively meet the needs of the timeliness of data mining. Therefore, in order to deal with massive data processing, the research of algorithm optimization and parallelization has become hot spot.AP clustering is an algorithm widely used in recent years, it has proposed in a number of scenarios with good clustering results. Compared with K-Means clustering algorithm, this algorithm does not need to set pre-cluster center, each data object is seen as a potential cluster center, passing messages between each other, then automatically generated cluster center. But the time and space complexity of this clustering algorithm is higher with increasing amount of data, the entire calculation time will longer. In order to make the AP clustering algorithm can be applied to analyze massive amounts of data efficiently, we study the parallelization method of AP clustering algorithm, making clustering algorithms can parallelized perform on cluster environment.Hadoop is an open source distributed computing framework, reference to Google’s MapReduce parallel thought, the parallelization of the underlying implementation details have been packaged so that developers can focus on strategic approach in parallel. To make up for the weakness of Hadoop in iterative algorithm, Berkeley proposed Spark, a distributed memory computing framework, by caching data into RDD to effectively enhance the performance of iterative algorithms. Based on the characteristics of both computing platform, we use them to parallelize AP clustering algorithm, and analyzes the performance of parallel algorithms and performance difference under both platforms. Based on the intrusion detection experiments of KDD99 massive data sets, the parallelized AP clustering algorithm under two computing platform has both good speedup and scalability. Meanwhile, through the optimization of Spark-memory computing framework, the higher efficiency of AP clustering algorithm, It is more suitable for mass data clustering analysis.Finally, we design and develop a cloud-based service platform for cluster analysis. Integrating paralleling clustering algorithm into the Hadoop and Spark platform seamlessly, provided simple and useful Restful interface to external. Also we have developed cloud based clustering services SDK for local calls, so the developers can directly call parallelized clustering algorithm, without understanding the underlying cloud computing details.

Keywords/Search Tags:

Affinity Propagation, Parallel Method, Distributed Computing, Hadoop, Spark

PDF Full Text Request

Related items

1	Research And Application Of Clustering Parallel Strategy For Affinity Propagation
2	Research And Application Of Affinity Propagation Based On Spark And Its Incremental Algorithm
3	Parallel Clustering Algorithm Based On MapReduce
4	Research And Application Of Distributed Demantic Neighbor Search Algorithm Based On Spark
5	Graph Reachability Distributed Computing And Application Based On Spark
6	Research On Affinity Propagation And Its Application In Image Clustering
7	Design And Implementation Of A Distributed Hybrid Index Structure Based On Spark
8	CT Image Processing And Parallel Spatial Statistics Of Coal And Rock Mass Based On Hadoop
9	Research On Memory Data Management Technology In Spark
10	Research On The Implementation Of Bursty Events Detection Based On Spark