Research On Partition Selection Strategy For Big Data Management Based On KNN Connection Processing

Posted on:2022-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:Y Hu

Full Text:PDF

GTID:2518306602470664

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the advent of the 5G era,global data has shown explosive growth.For the data processing with increasing avalanche under large datasets,the k nearest neighbors(KNN)algorithm is a particularly expensive operation for both classification and regression predictive problems.To predict the values of new datapoints,it is necessary to calculate the feature similarity between each object in the test dataset and each object in the training dataset.However,due to expensive computational cost,the single computer cannot handle similarity calculations between large-scale data sets.A feasible improvement strategy is to substantially reduce the number of training set samples involved in the calculation.The main contributions of this article are as follows:(1)In this paper,aiming at the limitations of the KNN algorithm in large-scale,high-dimensional data processing,we propose an adaptive v KNN algorithm,which adopts on the Voronoi diagram under the Map Reduce parallel framework and makes full use of the advantages of parallel computing in processing large-scale data.In the process of partition selection,we design a new predictive strategy for sample point to find the optimal Relevant-partition.Then,we can effectively collect irrelevant data,reduce KNN Join computation,and improve the operation efficiency.(2)Aiming at the contingency and instability of the random strategy of the v KNN algorithm in the data partitioning process,we further propose a cv KNN algorithm based on stable clustering.This method adds a new parallel partition strategy,and uses the Canopy-kmeans algorithm based on the �Minimum-Maximum principle� to stably and efficiently obtain the effective partition of the training data,while retaining the characteristics of the v KNN algorithm that dynamically matches the optimal Relevant-partition.Experimental results show that the methods proposed in this paper have good execution efficiency in a data-intensive environment,and have high scalability under the premise of ensuring accuracy.At the same time,the parallel nearest neighbor method based on stable clustering has higher partitioning efficiency and stability.

Keywords/Search Tags:

Large-scale data, MapReduce, Partition selection, KNN Join, Canopy-kmeans

PDF Full Text Request

Related items

1	Research And Design Of KNN-join Algorithm Based On MapReduce
2	Join Method Research Based On MapReduce
3	Research And Implementation Of A Hybird Recommendation System Based On Auto Encoder And Canopy-Kmeans Algorithm
4	Research On Complex Distance Measure Based MapReduce Similarity Join Techniques
5	Research On String Similarity Join Method Based On Hadoop Platform
6	Research On Distributed Spatial Join Algorithms For Large Scale Data
7	Research And Implementation Of Multi-plex Iteration Based On MapReduce
8	The Research Of Clustering Mining Based On Logistics History Data On The Hadoop
9	Research On Key Techniques Of High Performance Spatial Query Processing For Large Scale Spatial Data
10	Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce