Font Size: a A A

Research And Implementation Of Classification Algorithm Parallelization Based On Spark

Posted on:2018-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:J ShiFull Text:PDF
GTID:2348330512488150Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
The Internet has been deeply integrated into people's lives.People's daily behavior produces useful data at all times.It is increasingly important to quickly process and generate useful knowledge from these generated data.Data mining has gone to the era of cloud computing.Cloud computing has gradually replaced the traditional stand-alone computing.Through the use of data mining methods based on cloud computing platform,the system can get useful information from massive amounts of data accurately and quickly.The classification algorithm in data mining is an effective tool for trend prediction and personalized recommendation.This thesis studies the classification algorithm in data mining based on Spark which is the most popular open source cloud computing platform and selects the classical K-nearest neighbor algorithm to improve,to parallelize,and to achieve based on Spark.And then this thesis built a Spark platform and conducted the corresponding experiment.The improved K-nearest neighbor algorithm is mainly aimed at improving the weakness of the traditional K-nearest neighbor algorithm in the training phase without any processing.By analyzing the training data set in the training stage,I find some eigenvalues.Preprocessing the training set can reduce the number of training set samples that need to be compared at the classification stage.Then I can achieve the purpose of improving the speed of the algorithm.And then for the shortcoming that the Spark cloud computing platform use the default value to partition the data,this thesis optimized the program.In the case of a default partition situation,the number of data partitions and compute nodes does not match and it results in reduced performance.So this thesis optimized the program.The number of partitions is optimized for the number of compute nodes,so as to improve the utilization of computing resources and speed up the running speed of the algorithmThis thesis chooses the UCI machine learning data set and extends it to meet the requirement of the amount of data.This thesis compares the efficiency and accuracy of the K-nearest neighbor and the improved K-nearest neighbor in the stand-alone condition,and then tests the acceleration ratio of the improved K-nearest neighbor algorithm based on Spark.Through the analysis of the experimental results,the improved K-nearest neighbor algorithm has the same accuracy rate as the ordinary K-nearest neighbor algorithm,and the efficiency has been greatly improved.
Keywords/Search Tags:Data mining, Classification, K nearest neighbor, Parallelization, Spark
PDF Full Text Request
Related items