Research Of The Clustering Algorithm Based On The Spark

Posted on:2019-12-11

Degree:Master

Type:Thesis

Country:China

Candidate:Y Xie

Full Text:PDF

GTID:2428330590465589

Subject:Information and Communication Engineering

Abstract/Summary:

With the arrival of the era of big data,massive amounts of data have appeared in all walks of life.The ability to quickly and effectively obtain useful information from vast amounts of data is a measure of whether a company is competitive today.Cluster analysis plays an important role in data mining.In the face of massive data,traditional clustering methods cannot perform effective clustering.In 2009,the emergence of the Spark platform caused widespread concern,using its iterative calculations in memory,making it faster to calculate,with other platforms can not be compared to the advantages.The main work of this article is as follows:For the existing graph-based subspace clustering algorithm to deal with the unknown type of noise and the solution of complex convex problems and other limitations,based on the existing graph subspace clustering algorithm,combined with the spatial projection theory,through The original data is projected and encoded to achieve the purpose of eliminating noise.Based on this,a new method for constructing sparse similarity maps,abbreviated as a graph,is constructed.On the basis of this figure,a subspace clustering algorithm is developed.In order to be able to adapt to big data scenarios,use Scala language and Spark RDD and call related modules in MLlib to implement a distributed parallel clustering algorithm.The experimental results show that the proposed algorithm has better accuracy and robustness than the currently popular subspace clustering algorithms such as LSR1,LSR2,SSC,and LRR under Gaussian noise.In the stand-alone case,the accuracy of the algorithm is at least 1.71%higher than the accuracy of the LRR.Focused on the issue that the limitations of the existing graph-based subspace clustering algorithms for solving unknown structure of errors and complex convex problems,Based on the existing subspace clustering algorithm based on l2 graph and the theory of spatial projection,we constructs a new method to construct sparse similarity graph by projection coding the original data to eliminate errors.The subspace clustering algorithms are developed upon L2-graph.In order to be able to adapt to big data scenarios,use Scala language and Spark RDD and call related modules in MLlib to implement a distributed parallel clustering algorithm.The experimental results show that the proposed algorithm has better accuracy and robustness than the currentlypopular subspace clustering algorithms such as LSR1,LSR2,SSC,and LRR under Gaussian noise.In the stand-alone case,the accuracy of the algorithm is at least 1.71%higher than the accuracy of the LRR.

Keywords/Search Tags:

clustering, MinMax K-Means clustering, Spark, subspace clustering

Related items

1	A Deep Embedding Clustering Algorithm Considering Preservation Of Initial Clustering Structure And Its Application
2	Research And Implementation Of Clustering Method For High Dimensional Categorical Data
3	Research On Sparse Subspace Clustering Models And Algorithms Based On Low-rank Representation
4	Research On Fast And Effective Subspace Clustering Methods
5	Research On Improved Sparse Subspace Clustering Algorithm
6	Optimization And Application Of K-means Clustering Algorithm Based On Spark Framework
7	Research On Density Peak-based Clustering Algorithm And Its Parallel Implementation
8	High-dimensional Data Clustering Method Based On Embedded Subspace
9	Research And Application Of Clustering Method For Big Visual Data
10	The Research And Application Of Text Clustering Based On Improved K-means Algorithm