Research And Application Of Clustering Algorithm Based On Spark Platform

Posted on:2021-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y Ma

Full Text:PDF

GTID:2428330614460383

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the vigorous development of the modern information society,the Internet data is growing exponentially.The explosion of data scale promotes the advent of big data era.Research institutions,groups and companies are increasingly focused on how to extract useful value from large amounts of data.Various machine learning algorithms provide solutions to this kind of requirements.As an important branch,clustering algorithm has been widely used in many fields.However,the traditional clustering algorithm often run on a single machine.In the face of big data such algorithms often have poor performance and low efficiency.Its difficult to meet the performance requirements.In order to improve the big data processing ability of traditional clustering algorithm,the common method is to parallelize the algorithm based on distributed computing platform.With the advantage of distributed cluster,the performance of the algorithm can be improved.The mainstream distributed computing platforms are Apache Hadoop and Apache Spark.However,Spark,which stores intermediate results in memory,has a higher speed than Hadoop,which stores the results on disk.So Spark can better adapt to machine learning algorithms based on iterative computation.In this paper,some problems existing in clustering algorithm are improved and optimized.The optimized algorithm is parallelized based on the distributed platform Spark.The main work are as follows:(1)In this paper,a variety of clustering algorithms are studied.Several representative algorithms are selected for further study.The principle and operation characteristics of this algorithm are studied,and the advantages and disadvantages of the algorithm are analyzed.Some defects in the algorithm are optimized and improved to improve the efficiency of the algorithm.These problems include the sensitivity of the initialization center,the tendency to fall into local extremum,the difficulty in determining the number of class clusters,and the complexity of distance calculation.(2)In order to improve the big data processing capability of the algorithm,the algorithm was parallelized based on Spark distributed computing platform,and the RDD matrix and vector computing methods provided by Spark MLlib machine learning library were combined.Some RDD operations provided by Spark are used to improve the execution efficiency of the algorithm,including cache/persist data persistence,accumulator,broadcast variables,etc.(3)This paper builds the Spark big data experiment platform and runs the Spark cluster in the mode of �Spark on Yarn�.Based on the Spark experimental platform,the improved algorithm are applied to some real data sets to observe the improved effect and application value of the algorithm.The experimental data set consists of two parts:one part is taken from the UCI data set platform,mainly including the KDDcup simulated network attack data set,Iris data set and Wine data set;The other part is from a data set of City population collected by a company.Experimental results show that the improved algorithm have been improved in clustering accuracy,performance and practicability.

Keywords/Search Tags:

big data, Spark platform, clustering algorithm, optimization improvement, parallel computing

PDF Full Text Request

Related items

1	Research And Improvement Of Big Data Parallel Clustering Algorithm Based On Spark
2	Optimization And Implementation Of Clustering Algorithms Based On Spark Platform
3	Research On Parallization Of DBSCAN Clustering Algorithm For Spatial Data Mining Based On Spark Platform
4	Research On Optimization And Parallel Of K-means Algorithm On Spark
5	Research On Improved DBSCAN Algorithm Based On Spark Platform
6	Research And Implementation Of Memory Optimization Based On Parallel Computing Engine Spark
7	Research And Realization Of Clustering Algorithm Based On Spark Platform
8	Research And Application Of Big Data Clustering Algorithm Based On Spark Platform
9	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
10	Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming