With the advent of the 5G era,global data has shown explosive growth.For the data processing with increasing avalanche under large datasets,the k nearest neighbors(KNN)algorithm is a particularly expensive operation for both classification and regression predictive problems.To predict the values of new datapoints,it is necessary to calculate the feature similarity between each object in the test dataset and each object in the training dataset.However,due to expensive computational cost,the single computer cannot handle similarity calculations between large-scale data sets.A feasible improvement strategy is to substantially reduce the number of training set samples involved in the calculation.The main contributions of this article are as follows:(1)In this paper,aiming at the limitations of the KNN algorithm in large-scale,high-dimensional data processing,we propose an adaptive v KNN algorithm,which adopts on the Voronoi diagram under the Map Reduce parallel framework and makes full use of the advantages of parallel computing in processing large-scale data.In the process of partition selection,we design a new predictive strategy for sample point to find the optimal Relevant-partition.Then,we can effectively collect irrelevant data,reduce KNN Join computation,and improve the operation efficiency.(2)Aiming at the contingency and instability of the random strategy of the v KNN algorithm in the data partitioning process,we further propose a cv KNN algorithm based on stable clustering.This method adds a new parallel partition strategy,and uses the Canopy-kmeans algorithm based on the “Minimum-Maximum principle” to stably and efficiently obtain the effective partition of the training data,while retaining the characteristics of the v KNN algorithm that dynamically matches the optimal Relevant-partition.Experimental results show that the methods proposed in this paper have good execution efficiency in a data-intensive environment,and have high scalability under the premise of ensuring accuracy.At the same time,the parallel nearest neighbor method based on stable clustering has higher partitioning efficiency and stability. |