Font Size: a A A

Research And Application Of Clustering Algorithm Based On Spark Platform

Posted on:2021-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y MaFull Text:PDF
GTID:2428330614460383Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the vigorous development of the modern information society,the Internet data is growing exponentially.The explosion of data scale promotes the advent of big data era.Research institutions,groups and companies are increasingly focused on how to extract useful value from large amounts of data.Various machine learning algorithms provide solutions to this kind of requirements.As an important branch,clustering algorithm has been widely used in many fields.However,the traditional clustering algorithm often run on a single machine.In the face of big data such algorithms often have poor performance and low efficiency.Its difficult to meet the performance requirements.In order to improve the big data processing ability of traditional clustering algorithm,the common method is to parallelize the algorithm based on distributed computing platform.With the advantage of distributed cluster,the performance of the algorithm can be improved.The mainstream distributed computing platforms are Apache Hadoop and Apache Spark.However,Spark,which stores intermediate results in memory,has a higher speed than Hadoop,which stores the results on disk.So Spark can better adapt to machine learning algorithms based on iterative computation.In this paper,some problems existing in clustering algorithm are improved and optimized.The optimized algorithm is parallelized based on the distributed platform Spark.The main work are as follows:(1)In this paper,a variety of clustering algorithms are studied.Several representative algorithms are selected for further study.The principle and operation characteristics of this algorithm are studied,and the advantages and disadvantages of the algorithm are analyzed.Some defects in the algorithm are optimized and improved to improve the efficiency of the algorithm.These problems include the sensitivity of the initialization center,the tendency to fall into local extremum,the difficulty in determining the number of class clusters,and the complexity of distance calculation.(2)In order to improve the big data processing capability of the algorithm,the algorithm was parallelized based on Spark distributed computing platform,and the RDD matrix and vector computing methods provided by Spark MLlib machine learning library were combined.Some RDD operations provided by Spark are used to improve the execution efficiency of the algorithm,including cache/persist data persistence,accumulator,broadcast variables,etc.(3)This paper builds the Spark big data experiment platform and runs the Spark cluster in the mode of “Spark on Yarn”.Based on the Spark experimental platform,the improved algorithm are applied to some real data sets to observe the improved effect and application value of the algorithm.The experimental data set consists of two parts:one part is taken from the UCI data set platform,mainly including the KDDcup simulated network attack data set,Iris data set and Wine data set;The other part is from a data set of City population collected by a company.Experimental results show that the improved algorithm have been improved in clustering accuracy,performance and practicability.
Keywords/Search Tags:big data, Spark platform, clustering algorithm, optimization improvement, parallel computing
PDF Full Text Request
Related items