Font Size: a A A

Research And Application Of Parallelizing Affinity Propagation Clustering Based On Distributed Computing

Posted on:2016-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2308330464969393Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Clustering is a common data mining technology which aggregates data objects based on their characteristics, so that similar data objects cluster together. With the rapid development of traditional Internet and mobile Internet, the data generated by business company and user exploding growth, when faced with massive data mining, traditional clustering algorithms are extremely time consuming, they cannot effectively meet the needs of the timeliness of data mining. Therefore, in order to deal with massive data processing, the research of algorithm optimization and parallelization has become hot spot.AP clustering is an algorithm widely used in recent years, it has proposed in a number of scenarios with good clustering results. Compared with K-Means clustering algorithm, this algorithm does not need to set pre-cluster center, each data object is seen as a potential cluster center, passing messages between each other, then automatically generated cluster center. But the time and space complexity of this clustering algorithm is higher with increasing amount of data, the entire calculation time will longer. In order to make the AP clustering algorithm can be applied to analyze massive amounts of data efficiently, we study the parallelization method of AP clustering algorithm, making clustering algorithms can parallelized perform on cluster environment.Hadoop is an open source distributed computing framework, reference to Google’s MapReduce parallel thought, the parallelization of the underlying implementation details have been packaged so that developers can focus on strategic approach in parallel. To make up for the weakness of Hadoop in iterative algorithm, Berkeley proposed Spark, a distributed memory computing framework, by caching data into RDD to effectively enhance the performance of iterative algorithms. Based on the characteristics of both computing platform, we use them to parallelize AP clustering algorithm, and analyzes the performance of parallel algorithms and performance difference under both platforms. Based on the intrusion detection experiments of KDD99 massive data sets, the parallelized AP clustering algorithm under two computing platform has both good speedup and scalability. Meanwhile, through the optimization of Spark-memory computing framework, the higher efficiency of AP clustering algorithm, It is more suitable for mass data clustering analysis.Finally, we design and develop a cloud-based service platform for cluster analysis. Integrating paralleling clustering algorithm into the Hadoop and Spark platform seamlessly, provided simple and useful Restful interface to external. Also we have developed cloud based clustering services SDK for local calls, so the developers can directly call parallelized clustering algorithm, without understanding the underlying cloud computing details.
Keywords/Search Tags:Affinity Propagation, Parallel Method, Distributed Computing, Hadoop, Spark
PDF Full Text Request
Related items