Font Size: a A A

Research On Partition Selection Strategy For Big Data Management Based On KNN Connection Processing

Posted on:2022-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuFull Text:PDF
GTID:2518306602470664Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the 5G era,global data has shown explosive growth.For the data processing with increasing avalanche under large datasets,the k nearest neighbors(KNN)algorithm is a particularly expensive operation for both classification and regression predictive problems.To predict the values of new datapoints,it is necessary to calculate the feature similarity between each object in the test dataset and each object in the training dataset.However,due to expensive computational cost,the single computer cannot handle similarity calculations between large-scale data sets.A feasible improvement strategy is to substantially reduce the number of training set samples involved in the calculation.The main contributions of this article are as follows:(1)In this paper,aiming at the limitations of the KNN algorithm in large-scale,high-dimensional data processing,we propose an adaptive v KNN algorithm,which adopts on the Voronoi diagram under the Map Reduce parallel framework and makes full use of the advantages of parallel computing in processing large-scale data.In the process of partition selection,we design a new predictive strategy for sample point to find the optimal Relevant-partition.Then,we can effectively collect irrelevant data,reduce KNN Join computation,and improve the operation efficiency.(2)Aiming at the contingency and instability of the random strategy of the v KNN algorithm in the data partitioning process,we further propose a cv KNN algorithm based on stable clustering.This method adds a new parallel partition strategy,and uses the Canopy-kmeans algorithm based on the “Minimum-Maximum principle” to stably and efficiently obtain the effective partition of the training data,while retaining the characteristics of the v KNN algorithm that dynamically matches the optimal Relevant-partition.Experimental results show that the methods proposed in this paper have good execution efficiency in a data-intensive environment,and have high scalability under the premise of ensuring accuracy.At the same time,the parallel nearest neighbor method based on stable clustering has higher partitioning efficiency and stability.
Keywords/Search Tags:Large-scale data, MapReduce, Partition selection, KNN Join, Canopy-kmeans
PDF Full Text Request
Related items