Font Size: a A A

Research Of The Clustering Algorithm Based On The Spark

Posted on:2019-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y XieFull Text:PDF
GTID:2428330590465589Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the arrival of the era of big data,massive amounts of data have appeared in all walks of life.The ability to quickly and effectively obtain useful information from vast amounts of data is a measure of whether a company is competitive today.Cluster analysis plays an important role in data mining.In the face of massive data,traditional clustering methods cannot perform effective clustering.In 2009,the emergence of the Spark platform caused widespread concern,using its iterative calculations in memory,making it faster to calculate,with other platforms can not be compared to the advantages.The main work of this article is as follows:For the existing graph-based subspace clustering algorithm to deal with the unknown type of noise and the solution of complex convex problems and other limitations,based on the existing graph subspace clustering algorithm,combined with the spatial projection theory,through The original data is projected and encoded to achieve the purpose of eliminating noise.Based on this,a new method for constructing sparse similarity maps,abbreviated as a graph,is constructed.On the basis of this figure,a subspace clustering algorithm is developed.In order to be able to adapt to big data scenarios,use Scala language and Spark RDD and call related modules in MLlib to implement a distributed parallel clustering algorithm.The experimental results show that the proposed algorithm has better accuracy and robustness than the currently popular subspace clustering algorithms such as LSR1,LSR2,SSC,and LRR under Gaussian noise.In the stand-alone case,the accuracy of the algorithm is at least 1.71%higher than the accuracy of the LRR.Focused on the issue that the limitations of the existing graph-based subspace clustering algorithms for solving unknown structure of errors and complex convex problems,Based on the existing subspace clustering algorithm based on l2 graph and the theory of spatial projection,we constructs a new method to construct sparse similarity graph by projection coding the original data to eliminate errors.The subspace clustering algorithms are developed upon L2-graph.In order to be able to adapt to big data scenarios,use Scala language and Spark RDD and call related modules in MLlib to implement a distributed parallel clustering algorithm.The experimental results show that the proposed algorithm has better accuracy and robustness than the currentlypopular subspace clustering algorithms such as LSR1,LSR2,SSC,and LRR under Gaussian noise.In the stand-alone case,the accuracy of the algorithm is at least 1.71%higher than the accuracy of the LRR.
Keywords/Search Tags:clustering, MinMax K-Means clustering, Spark, subspace clustering
PDF Full Text Request
Related items