Research And Implementation Of Classification Algorithm Parallelization Based On Spark

Posted on:2018-04-06

Degree:Master

Type:Thesis

Country:China

Candidate:J Shi

Full Text:PDF

GTID:2348330512488150

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

The Internet has been deeply integrated into people's lives.People's daily behavior produces useful data at all times.It is increasingly important to quickly process and generate useful knowledge from these generated data.Data mining has gone to the era of cloud computing.Cloud computing has gradually replaced the traditional stand-alone computing.Through the use of data mining methods based on cloud computing platform,the system can get useful information from massive amounts of data accurately and quickly.The classification algorithm in data mining is an effective tool for trend prediction and personalized recommendation.This thesis studies the classification algorithm in data mining based on Spark which is the most popular open source cloud computing platform and selects the classical K-nearest neighbor algorithm to improve,to parallelize,and to achieve based on Spark.And then this thesis built a Spark platform and conducted the corresponding experiment.The improved K-nearest neighbor algorithm is mainly aimed at improving the weakness of the traditional K-nearest neighbor algorithm in the training phase without any processing.By analyzing the training data set in the training stage,I find some eigenvalues.Preprocessing the training set can reduce the number of training set samples that need to be compared at the classification stage.Then I can achieve the purpose of improving the speed of the algorithm.And then for the shortcoming that the Spark cloud computing platform use the default value to partition the data,this thesis optimized the program.In the case of a default partition situation,the number of data partitions and compute nodes does not match and it results in reduced performance.So this thesis optimized the program.The number of partitions is optimized for the number of compute nodes,so as to improve the utilization of computing resources and speed up the running speed of the algorithmThis thesis chooses the UCI machine learning data set and extends it to meet the requirement of the amount of data.This thesis compares the efficiency and accuracy of the K-nearest neighbor and the improved K-nearest neighbor in the stand-alone condition,and then tests the acceleration ratio of the improved K-nearest neighbor algorithm based on Spark.Through the analysis of the experimental results,the improved K-nearest neighbor algorithm has the same accuracy rate as the ordinary K-nearest neighbor algorithm,and the efficiency has been greatly improved.

Keywords/Search Tags:

Data mining, Classification, K nearest neighbor, Parallelization, Spark

PDF Full Text Request

Related items

1	Research And Application Of K Nearest Neighbor Classification Algorithm Based On Spark
2	Research On Improved K Nearest Neighbor Algorithm Based On Spark Cloud Computing Platform
3	Outlier Mining And Parallelization Based On Reverse K-Nearest Neighbor Count And Weight Pruning
4	Study On Generalized Nearest Neighbor Pattern Classification
5	Mining Research, Based On The Integration Algorithm Of The K-nearest Neighbor Classification
6	Outlier Detection Algorithm And Its Parallelization Based On Weighted K-Nearest Neighbor
7	Research On Several Pattern Classification Methods Based On K-nearest Neighbor Criterion
8	The Research On Classification And Regression Tree's Parallelization Based On Spark Platform
9	Research On The High-Efficient K-Nearest Neighbor Algorithm And Its Parallelization Of MPI
10	Classification Of Uncertain Data Based On Nearest Neighbor