Research On The High-Efficient K-Nearest Neighbor Algorithm And Its Parallelization Of MPI

Posted on:2020-11-30

Degree:Master

Type:Thesis

Country:China

Candidate:X Ding

Full Text:PDF

GTID:2428330590471699

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The basic k-nearest neighbor algorithm achieves classification or regression by searching for k samples with the highest similarity in the dataset.It has no explicit learning process,and has the advantages of simple,offline learning,and multi-classification.This algorithm needs to calculate the distance between each node and all other nodes,so it is also called full search algorithm(FSA).And it requires the time complexity of O(N~2)to obtain the optimal k value,so the calculation cost is relatively high.At present,there are mainly two kinds of acceleration algorithms.One is an accurate neighbor search algorithm combined with tree index.Although this algorithm can reduce the distance calculation by the tree index,its time complexity is still close to FSA in high-dimensional data.The other is an approximate neighbor search algorithm.Compared with the accurate neighbor search algorithm,although it has a slight deviation in searching for neighbors,it has obvious advantages in time complexity,so it has been widely studied.In the approximate k-nearest neighbor algorithms,the priority search k-means tree algorithm is an efficient approximate neighbor search algorithm for high-dimensional data,which uses k-means clustering to construct a tree model.Due to the low time complexity of k-means clustering and the adoption of specific strategies to search for approximate neighbors in the tree,the efficiency of the algorithm is very high.However,the attribute noise that may exist in high-dimensional data will directly affect the construction of k-means tree in this algorithm.To solve this problem,based on the priority search of k-means tree,this paper introduces k-means forest through bootstrap aggregating and proposes the priority search k-means forest algorithm.Compared with the original algorithm,the algorithm not only has less time overhead,but also has certain robustness to attribute noise in high-dimensional data.At the same time,based on MPI,the paper further improves the efficiency of processing data by parallelizing the algorithm.In addition,in order to solve the problem that the accurate neighbor algorithm relies on the tree structure,this paper proposes a near-neighbor search algorithm for the ball model.This algorithm replaces the original data set with the balls generated by k-means clustering,instead of relying on the tree structure.It not only improves the classification efficiency,but also reduces the interference of label noise on the classification result,so that it can get better results on some data sets.

Keywords/Search Tags:

k-nearest neighbor algorithm, high-dimensional data, k-means, noise, MPI

PDF Full Text Request

Related items

1	Stacked Hashing Quantization Algorithm For Nearest Neighbor Search
2	Research On K-nearest Neighbor Search Algorithm In High Dimensional Space
3	Research Of Local Sensitive Hash Index Based On Nearest Neighbor Graph
4	Research On K-nearest Neighbor Query Technology On High-dimensional Data
5	Efficient computation of k-nearest neighbor graphs for large high-dimensional data sets on gpu clusters
6	Research On K Nearest Neighbor In High Dimension Data
7	Hash-based Approximate Nearest Neighbor Search For High-dimensional Data
8	Researches On Abnormal Data Detection Algorithms With Adaptive K-Nearest Neighbor
9	Multiple Hash Tables Indexing And Optimization For Approximate Nearest Neighbor Search
10	Approximate Nearest Neighbor Search For High-Dimensional Based On Nearest Neighbor Graph