Font Size: a A A

Research And Application Of K Nearest Neighbor Classification Algorithm Based On Spark

Posted on:2019-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:X Z YanFull Text:PDF
GTID:2428330548486995Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,a large amount of information has been produced.How to obtain valuable information from it is a very meaningful research content.With more and more information,a single machine has been unable to deal with such data,Hadoop was born,but Hadoop's computing model is more complex to write code,and the calculation mode is based on disk,which leads to slow calculations,Spark The birth of a good make up for the Hadoop flaw,more and more people choose Spark as a computing framework for big data.Classification algorithm is an important part of data mining,mainly used for prediction and recommendation.Spark MLlib is a machine learning algorithm library in Spark.However,because Spark was just born,its algorithm library is not perfect.However,the K nearest neighbor algorithm is not supported in the machine learning algorithm library MLlib of Spark,but the K nearest neighbor algorithm is simple and effective.It is easy to implement and widely used.Therefore,it is necessary to implement the K nearest neighbor algorithm on the Spark platform.This paper combines the clustering algorithm and the K-nearest neighbor algorithm,and uses the clustering algorithm to first find the center of the sample category of each class in the training sample set,and then finds the distance of each training sample from the center of the sample class in the training set.The reciprocal of each square is used as the weight,and the weights are used to distinguish the K nearest neighbors of the test sample.Finally,a weighted voting strategy is used to classify the K nearest neighbors.Through experimental verification,the improved K-nearest neighbor algorithm has a better accuracy.Then the parallel K-nearest neighbor algorithm is designed and parallelized on the Spark platform.The Spark cluster was set up for experimental analysis.The experimental verification algorithm used to run on the Spark platform was significantly slower than the single machine,and the efficiency of the algorithm was significantly improved.This paper analyzes and studies the data inclining condition when the K nearest neighbor algorithm is parallelized on the Spark platform.The data skew influences the execution efficiency of the algorithm very much.When the K neighbor algorithm calculates the larger data amount,the algorithm execution efficiency is lower.This paper improves and optimizes the K-nearest neighbor algorithm's parallelization,andproposes five solutions for data skew in different scenarios,which solves the problem of data skew.Through experiments,the validity of the data tilting solution was first verified,and then the acceleration ratio and running time of the algorithm were verified.It was found that the efficiency of the optimized algorithm was significantly improved.
Keywords/Search Tags:Spark, K nearest neighbors, Weight, Parallelization, Data skew
PDF Full Text Request
Related items