Improvement Of K-Nearest Neighbors Classification Algorithms Based On Reference Points And Its Parallelization With MPI

Posted on:2019-08-28

Degree:Master

Type:Thesis

Country:China

Candidate:C Liang

Full Text:PDF

GTID:2428330590965783

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The k-nearest neighbors(kNN)algorithm is a classical classification method based on statistics.It is characterized by its simplicity,high accuracy and no prior statistical knowledge.It has become one of the most widely studied and applied algorithms in the field of data mining.Based on the analysis of the existing k-nearest neighbors algorithm,this thesis mainly studies the improvement and parallelization of k-nearest neighbors algorithm.The traditional k-nearest neighbors algorithm has quadratic time complexity.In order to improve the speed of classification,one class of methods are to construct the tree index structure to speed up the k-nearest neighbors search,but the performance of the tree-based k-nearest neighbors algorithm get worse with the increase of the dimension of the data set.The other methods avoid the calculation of the exact nearest neighbors by looking up the approximate nearest neighbors.Among these methods,the k-nearest neighbors algorithm based on reference points has higher classification efficiency for all kinds of data sets,but the classification accuracy is still a big loss compared with the traditional k-nearest neighbors algorithm.To solve this problem,an improved k-nearest neighbors algorithm based on reference points is proposed in this thesis,and its time complexity is only O(nlogn).Considering the spatial distribution of the training samples,the selection of the reference points is improved according to the variance of the distance between the training samples and candidate points,and the adaptive weight is given to the reference points according to the different functions of the reference points when searching the nearest samples.Experiments on public and artificial data sets show that the proposed algorithm achieves high classification efficiency,and better classification accuracy than the existing k-nearest neighbors algorithm based on reference points.In order to solve the problems of weak computing performance,high memory consumption and poor scalability in large-scale data classification under stand-alone environment,two kinds of parallel algorithm named DPkNN and PkNN,based on MPI(Message Passing Interface)model,are proposed.The former is based on data parallelism,and distributes the data set to be categorized evenly to each process for parallel processing.The latter constructs all the processes as a pipeline model,and each process completes a part of the classification task in parallel.For k-nearest neighbors algorithm based on reference points proposed in this thesis,a method to obtain the reference points by dividing the training set is also designed.The two kinds of parallel algorithm are applied to k-nearest neighbors algorithm based on reference points and the traditional kNN algorithm,and the experiments are carried out on a super-computing platform using up to 48 cores resources.The results show that,comparing to existing parallel methods,the two parallel algorithms can significantly improve the efficiency of classification,and the PkNN algorithm has better speedup and scalability.Finally,kNN algorithm is applied to handwriting recognition,and a handwritten recognition system integrating various algorithms has been implemented in this thesis.It has a friendly interactive interface and can recognize the numeral letter of handwritten input effectively.At the same time,the effectiveness of the proposed algorithm in the third chapter is further verified by the handwritten dataset.

Keywords/Search Tags:

kNN, classification, reference points, MPI, parallel algorithm

PDF Full Text Request

Related items

1	Research On Evolutionary Clustering Based On Reference Points
2	Research On Rogue Access Point Indoor Positioning Based On Reference Points Optimization
3	Research On Density Clustering Algorithm Based On Reference Points
4	Covering points with axis parallel lines
5	Research On Classification Based On Improved And Parallel Artificial Bee Colony Algorithm
6	The Parallel Of Bayesian Classification Algorithms For CRM
7	Research On Fast Parallel Image Matching Algorithm Based On Feature Points
8	Research On 3D Fingerprint Recognition Technology Based On The Space Feature Points
9	Research On Hyperspectral Imagery Classification Based On Parallel Support Vector Machine In Cloud Environment
10	Parallel support vector machines for multi-category classification of large scale data