Research And Implementation Of Large-Scale And Efficient Clustering Algorithm Based On Spark

Posted on:2019-04-04

Degree:Master

Type:Thesis

Country:China

Candidate:S B Huang

Full Text:PDF

GTID:2428330545485134

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Clustering analysis which also called unsupervised classification is one of the fundamental algorithm in machine learning,data mining and pattern recognition.It is widely used in the fields of computer science,economics,medicine,social science,etc,so it has been paid more and more attention from all walks of life.The purpose of clustering is to make the same category have a high similarity,and the similarity between the different classes is as low as possible,so as to mine the potential knowledgeIn recent years,many big data clustering methods based on various distributed platforms have been proposed.However,there are still many challenges such as large computation costs,long iteration time and low clustering effect,which affect the efficiency of clustering in large scale data sets.For example,the K-means algorithm based on Spark are intuitive and easy to implement,but the final clustering results are greatly influenced by the selection of initial cluster centers and converge to local optimal solution.Furthermore,the spectral clustering can achieve clustering on the sample space of arbitrary shape.Nevertheless,the large computation overhead is a main problem of spectral clustering algorithm,which requires not only the calculation of pairwise similarities between all samples but also the calculation of eigenvectors of Laplacian matrix.Co-clustering utilizes the duality of feature clustering and sample clustering at the same time to find potential patterns,which achieves better clustering results than traditional ways.While the reason of computationally prohibitive for large data set severely limits its scope of application.In this paper,by considering the problems of existing work,the efficient big data clustering algorithms based on Spark are presented.The main contents and contributions of this paper are as follows:(1)We design an efficient and reliable data preprocessing implementation for big data clustering analysis.Compared with the existing methods,the proposed method has better performance and scalability and further improve the clustering effect.(2)We propose a method named density-aware and auto select K to optimize initial center points,which can initialize the number ad position of cluster center adaptively according to the data distribution.And in the process of distance computation,an optimized strategy is utilized.(3)In order to reduce the computation overhead of large-scale similarity measurement in spectral clustering algorithm,we propose a novel parallel similarity computation approach named multi-round iteration technique,which avoid duplicated calculation.At the same time,an efficient eigenvector parallelization algorithm based on ScaLAPACK is implemented,which shortens the time of solving the feature problem and shows performance.(4)We propose a novel co-clustering algorithm based on non-negative matrix decomposition(NMF)and its parallelization implementation,which only requires fewer iterations and obtains promising clustering results in the case of convergence and show good performance and scalability.(5)Based on the widely used distributed data-parallel computing platform Spark,we design and implement the fast K-means,parallel spectral clustering and co-clustering algorithms,which can ensure promising clustering results and achieve good data and system scalability in experiments on various types of data sets.The fast K-means algorithm has won the first prize in the third "National Contest on Application and Innovation of Cloud Computing" hosted by Ministry of Education,Science and Technology Development Center in April 2017.

Keywords/Search Tags:

Big Data, Clustering, Similarity Measurement, Parallelization

PDF Full Text Request

Related items

1	Research Of Clustering Algorithms For Mixed Data Based On Attribute Weighting And Similarity Measuring
2	An Improved Fast Clustering Algorithm And The Related Parallelization Research
3	Research On Parallelization Of Data Stream Clustering Algorithm For Police Data
4	The Research On The Method Of QAR Data Organization Based On Data Warehouse And The Similarity Measurement Of Clustering Pattern
5	Research On Similarity Measurement Method For Mobile Traffic Data
6	Research On Clustering Algorithm Of Data Stream
7	A New Nearest Neighbor Measurement Method And Its Application In Clustering Algorithm
8	Research On Clustering Algorithm Based On Tabu Search Algorithm And Similarity Measurement
9	Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark
10	Research On Directional Clustering And It's Applications