Font Size: a A A

Research On The High-Efficient K-Nearest Neighbor Algorithm And Its Parallelization Of MPI

Posted on:2020-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:X DingFull Text:PDF
GTID:2428330590471699Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The basic k-nearest neighbor algorithm achieves classification or regression by searching for k samples with the highest similarity in the dataset.It has no explicit learning process,and has the advantages of simple,offline learning,and multi-classification.This algorithm needs to calculate the distance between each node and all other nodes,so it is also called full search algorithm(FSA).And it requires the time complexity of O(N~2)to obtain the optimal k value,so the calculation cost is relatively high.At present,there are mainly two kinds of acceleration algorithms.One is an accurate neighbor search algorithm combined with tree index.Although this algorithm can reduce the distance calculation by the tree index,its time complexity is still close to FSA in high-dimensional data.The other is an approximate neighbor search algorithm.Compared with the accurate neighbor search algorithm,although it has a slight deviation in searching for neighbors,it has obvious advantages in time complexity,so it has been widely studied.In the approximate k-nearest neighbor algorithms,the priority search k-means tree algorithm is an efficient approximate neighbor search algorithm for high-dimensional data,which uses k-means clustering to construct a tree model.Due to the low time complexity of k-means clustering and the adoption of specific strategies to search for approximate neighbors in the tree,the efficiency of the algorithm is very high.However,the attribute noise that may exist in high-dimensional data will directly affect the construction of k-means tree in this algorithm.To solve this problem,based on the priority search of k-means tree,this paper introduces k-means forest through bootstrap aggregating and proposes the priority search k-means forest algorithm.Compared with the original algorithm,the algorithm not only has less time overhead,but also has certain robustness to attribute noise in high-dimensional data.At the same time,based on MPI,the paper further improves the efficiency of processing data by parallelizing the algorithm.In addition,in order to solve the problem that the accurate neighbor algorithm relies on the tree structure,this paper proposes a near-neighbor search algorithm for the ball model.This algorithm replaces the original data set with the balls generated by k-means clustering,instead of relying on the tree structure.It not only improves the classification efficiency,but also reduces the interference of label noise on the classification result,so that it can get better results on some data sets.
Keywords/Search Tags:k-nearest neighbor algorithm, high-dimensional data, k-means, noise, MPI
PDF Full Text Request
Related items