Research And Application Of K Nearest Neighbor Classification Algorithm Based On Spark

Posted on:2019-03-08

Degree:Master

Type:Thesis

Country:China

Candidate:X Z Yan

Full Text:PDF

GTID:2428330548486995

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology,a large amount of information has been produced.How to obtain valuable information from it is a very meaningful research content.With more and more information,a single machine has been unable to deal with such data,Hadoop was born,but Hadoop's computing model is more complex to write code,and the calculation mode is based on disk,which leads to slow calculations,Spark The birth of a good make up for the Hadoop flaw,more and more people choose Spark as a computing framework for big data.Classification algorithm is an important part of data mining,mainly used for prediction and recommendation.Spark MLlib is a machine learning algorithm library in Spark.However,because Spark was just born,its algorithm library is not perfect.However,the K nearest neighbor algorithm is not supported in the machine learning algorithm library MLlib of Spark,but the K nearest neighbor algorithm is simple and effective.It is easy to implement and widely used.Therefore,it is necessary to implement the K nearest neighbor algorithm on the Spark platform.This paper combines the clustering algorithm and the K-nearest neighbor algorithm,and uses the clustering algorithm to first find the center of the sample category of each class in the training sample set,and then finds the distance of each training sample from the center of the sample class in the training set.The reciprocal of each square is used as the weight,and the weights are used to distinguish the K nearest neighbors of the test sample.Finally,a weighted voting strategy is used to classify the K nearest neighbors.Through experimental verification,the improved K-nearest neighbor algorithm has a better accuracy.Then the parallel K-nearest neighbor algorithm is designed and parallelized on the Spark platform.The Spark cluster was set up for experimental analysis.The experimental verification algorithm used to run on the Spark platform was significantly slower than the single machine,and the efficiency of the algorithm was significantly improved.This paper analyzes and studies the data inclining condition when the K nearest neighbor algorithm is parallelized on the Spark platform.The data skew influences the execution efficiency of the algorithm very much.When the K neighbor algorithm calculates the larger data amount,the algorithm execution efficiency is lower.This paper improves and optimizes the K-nearest neighbor algorithm's parallelization,andproposes five solutions for data skew in different scenarios,which solves the problem of data skew.Through experiments,the validity of the data tilting solution was first verified,and then the acceleration ratio and running time of the algorithm were verified.It was found that the efficiency of the optimized algorithm was significantly improved.

Keywords/Search Tags:

Spark, K nearest neighbors, Weight, Parallelization, Data skew

PDF Full Text Request

Related items

1	Research And Implementation Of Classification Algorithm Parallelization Based On Spark
2	Outlier Mining And Parallelization Based On Reverse K-Nearest Neighbor Count And Weight Pruning
3	Research On Partition Loading Balance Based On Spark Data Skew
4	Research On And Application Of The Solution For Spark Data Skew Scenarios
5	Research Of Data Skew On Spark Based On Imporved Partition Method
6	Research On Distributed K-Nearest Neighbors Query Method Over Moving Trajectories
7	Research On Spark Data Skewing Improvement And Decision Tree Parallelization Application Under Cloud Edge Collaboration
8	Research On Data Stream Reverse K Nearest Neighbors Outlier Mining Algorithm Based On X~* Tree
9	Research On Several Pattern Classification Methods Based On K-nearest Neighbor Criterion
10	Research And Optimization Of Adaptive Techniques For Mitigating Skew In Spark