Font Size: a A A

Optimization And Implementation Of Clustering Algorithms Based On Spark Platform

Posted on:2017-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:P CaoFull Text:PDF
GTID:2308330485457845Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In modern information society, with the increase of data scale, the demand of clustering on large-scale data set and generating useful information is increasing. Now there are following difficulties for large-scale data clustering:first, the requirements of machine memory capacity are beyond the single computer hardware capability; second, as the time consuming of clustering is huge, the efficiency fails to be improved. Thus clustering algorithm optimization on large-scale data, boils down to the optimization of data scale and algorithm on the distributed platforms. In recent years, distributed computing platform Spark has gained widespread attention. The iterative calculation in memory for mass data can be conducted on Spark, which can make calculation more quickly. Thus Spark has incomparable advantage over other distributed platforms.This thesis focuses on the optimization and implementation of specific clustering analysis algorithm based on Spark platform; meanwhile, the preprocessing for clustering data can reduce the data scale and improve operation efficiency without changing the clustering effect. This paper select Affinity Propagation clustering and Spectral clustering as the optimized object which are proposed in recent years and widely applied. The main work of this thesis is as follows:(1) For data scale issues of clustering algorithm, this thesis introduces a new parameter called threshold to preprocess similarity data sets. In this method according to the cluster numbers, the threshold is calculated based on the data density in the space. Thus the method can eliminating similarity data below the threshold and reserve effective similarity data, and then optimize data structure and generate sparse matrix. The method is aiming at reducing the data scale without changing the clustering effect.(2) For Affinity Propagation clustering, this thesis introduces a partitioning method of Affinity Propagation algorithms based on Spark platform. In the method a specific data structure is introduced to index the matrix rows on Spark. In iterative computation responsibility matrix is calculated by row partitions in parallel then write the result in column, and availability matrix is calculated by column partitions in parallel then write the result in row. After iterative computation, the clustering results are eventually generated. This method can reduce the data transmission and improve the algorithm efficiency.(3) For Spectral clustering, this thesis introduces a Lanczos method of Spectral clustering algorithms based on Spark platform. In the algorithm a Lanczos method is introduced to generate triple diagonal matrix parallelly on Spark platform based on the Laplacian matrix. The method can decrease the time complexity because the triple diagonal matrix is easy to decompose into n×k eigenvector matrix, which can realize the reduction of matrix dimension. Then the parallel Affinity Propagation clustering algorithm is used instead of the original K-means algorithm to cluster the middle results. This method can improve the time efficiency the clustering algorithm.Experiments shows that the preprocessing of similarity data sets and optimization of the above two algorithms can improve the time efficiency of clustering without losing the accuracy. In this paper, the method is helpful for improving data clustering processing efficiency, and laid a theoretical foundation for future research for other clustering algorithms to improve the performance.
Keywords/Search Tags:Big Data, Distributed Computing, Clustering, Spark Platform
PDF Full Text Request
Related items