Font Size: a A A

Improvement Of K-Nearest Neighbors Classification Algorithms Based On Reference Points And Its Parallelization With MPI

Posted on:2019-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:C LiangFull Text:PDF
GTID:2428330590965783Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The k-nearest neighbors(kNN)algorithm is a classical classification method based on statistics.It is characterized by its simplicity,high accuracy and no prior statistical knowledge.It has become one of the most widely studied and applied algorithms in the field of data mining.Based on the analysis of the existing k-nearest neighbors algorithm,this thesis mainly studies the improvement and parallelization of k-nearest neighbors algorithm.The traditional k-nearest neighbors algorithm has quadratic time complexity.In order to improve the speed of classification,one class of methods are to construct the tree index structure to speed up the k-nearest neighbors search,but the performance of the tree-based k-nearest neighbors algorithm get worse with the increase of the dimension of the data set.The other methods avoid the calculation of the exact nearest neighbors by looking up the approximate nearest neighbors.Among these methods,the k-nearest neighbors algorithm based on reference points has higher classification efficiency for all kinds of data sets,but the classification accuracy is still a big loss compared with the traditional k-nearest neighbors algorithm.To solve this problem,an improved k-nearest neighbors algorithm based on reference points is proposed in this thesis,and its time complexity is only O(nlogn).Considering the spatial distribution of the training samples,the selection of the reference points is improved according to the variance of the distance between the training samples and candidate points,and the adaptive weight is given to the reference points according to the different functions of the reference points when searching the nearest samples.Experiments on public and artificial data sets show that the proposed algorithm achieves high classification efficiency,and better classification accuracy than the existing k-nearest neighbors algorithm based on reference points.In order to solve the problems of weak computing performance,high memory consumption and poor scalability in large-scale data classification under stand-alone environment,two kinds of parallel algorithm named DPkNN and PkNN,based on MPI(Message Passing Interface)model,are proposed.The former is based on data parallelism,and distributes the data set to be categorized evenly to each process for parallel processing.The latter constructs all the processes as a pipeline model,and each process completes a part of the classification task in parallel.For k-nearest neighbors algorithm based on reference points proposed in this thesis,a method to obtain the reference points by dividing the training set is also designed.The two kinds of parallel algorithm are applied to k-nearest neighbors algorithm based on reference points and the traditional kNN algorithm,and the experiments are carried out on a super-computing platform using up to 48 cores resources.The results show that,comparing to existing parallel methods,the two parallel algorithms can significantly improve the efficiency of classification,and the PkNN algorithm has better speedup and scalability.Finally,kNN algorithm is applied to handwriting recognition,and a handwritten recognition system integrating various algorithms has been implemented in this thesis.It has a friendly interactive interface and can recognize the numeral letter of handwritten input effectively.At the same time,the effectiveness of the proposed algorithm in the third chapter is further verified by the handwritten dataset.
Keywords/Search Tags:kNN, classification, reference points, MPI, parallel algorithm
PDF Full Text Request
Related items