Font Size: a A A

Research And Implementation Of Large-Scale And Efficient Clustering Algorithm Based On Spark

Posted on:2019-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:S B HuangFull Text:PDF
GTID:2428330545485134Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering analysis which also called unsupervised classification is one of the fundamental algorithm in machine learning,data mining and pattern recognition.It is widely used in the fields of computer science,economics,medicine,social science,etc,so it has been paid more and more attention from all walks of life.The purpose of clustering is to make the same category have a high similarity,and the similarity between the different classes is as low as possible,so as to mine the potential knowledgeIn recent years,many big data clustering methods based on various distributed platforms have been proposed.However,there are still many challenges such as large computation costs,long iteration time and low clustering effect,which affect the efficiency of clustering in large scale data sets.For example,the K-means algorithm based on Spark are intuitive and easy to implement,but the final clustering results are greatly influenced by the selection of initial cluster centers and converge to local optimal solution.Furthermore,the spectral clustering can achieve clustering on the sample space of arbitrary shape.Nevertheless,the large computation overhead is a main problem of spectral clustering algorithm,which requires not only the calculation of pairwise similarities between all samples but also the calculation of eigenvectors of Laplacian matrix.Co-clustering utilizes the duality of feature clustering and sample clustering at the same time to find potential patterns,which achieves better clustering results than traditional ways.While the reason of computationally prohibitive for large data set severely limits its scope of application.In this paper,by considering the problems of existing work,the efficient big data clustering algorithms based on Spark are presented.The main contents and contributions of this paper are as follows:(1)We design an efficient and reliable data preprocessing implementation for big data clustering analysis.Compared with the existing methods,the proposed method has better performance and scalability and further improve the clustering effect.(2)We propose a method named density-aware and auto select K to optimize initial center points,which can initialize the number ad position of cluster center adaptively according to the data distribution.And in the process of distance computation,an optimized strategy is utilized.(3)In order to reduce the computation overhead of large-scale similarity measurement in spectral clustering algorithm,we propose a novel parallel similarity computation approach named multi-round iteration technique,which avoid duplicated calculation.At the same time,an efficient eigenvector parallelization algorithm based on ScaLAPACK is implemented,which shortens the time of solving the feature problem and shows performance.(4)We propose a novel co-clustering algorithm based on non-negative matrix decomposition(NMF)and its parallelization implementation,which only requires fewer iterations and obtains promising clustering results in the case of convergence and show good performance and scalability.(5)Based on the widely used distributed data-parallel computing platform Spark,we design and implement the fast K-means,parallel spectral clustering and co-clustering algorithms,which can ensure promising clustering results and achieve good data and system scalability in experiments on various types of data sets.The fast K-means algorithm has won the first prize in the third "National Contest on Application and Innovation of Cloud Computing" hosted by Ministry of Education,Science and Technology Development Center in April 2017.
Keywords/Search Tags:Big Data, Clustering, Similarity Measurement, Parallelization
PDF Full Text Request
Related items