Research On Big Data Sample Selection Based On MapReduce/Spark

Posted on:2021-03-16

Degree:Master

Type:Thesis

Country:China

Candidate:D D Song

Full Text:PDF

GTID:2428330620470570

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years,big data is a very hot research topic.In a big data scenario,traditional machine learning algorithms encounter great difficulties and challenges.How to solve this problem and extend traditional machine learning algorithms to big data environment have important research value and significance.Sample selection is a feasible solution to solve this problem.It selects important samples from big data and remove unimportant samples,redundant samples and noise samples from big data set.This paper studies the problem of sample selection from big data,the main work includes the following four parts:1.Inspired by divide-and-conquer strategy and the idea of cross-validation,a sample selection framework based on MapReduce/Spark is proposed.The basic idea of this framework is to partition the big data set into several subsets.When selecting samples from a certain subset,a committee composed of classifiers trained from other subsets is used to evaluate the importance of the samples in this subset and select important samples in parallel.Under this framework,two sample selection algorithms are proposed:(1)Based on MapReduce/Spark and voting entropy,a big data crossover sample selection algorithm is proposed.The proposed algorithm uses the voting entropy to measure the importance of samples of data subsets,and on multiple cloud computing nodes,important samples are selected from local data subsets,and the proposed algorithm is implemented with MapReduce and Spark respectively.(2)Based on MapReduce/Spark and genetic algorithm,a big data crossover sample selection algorithm is proposed.This algorithm encodes the sample subset in binary,and uses average information entropy of samples of subset as fitness function,and conduct cross sample selection in an evolutionary way using MapReduce/Spark computing framework.2.This paper also proposes a big data sample selection algorithm based on MapReduce/Spark and locality sensitive hashing.The basic idea of this algorithm is to partition the big data set into several subsets and deploy them to different cloud computing nodes.On each node,the locality sensitive hashing transformation for local data subset is performed using MapReduce/Spark computing framework,and the samples with the same hash code are put into the same bucket,and samples are selected from each bucket in a certain proportion.3.The three proposed big data sample selection algorithms were tested and compared with the existing big data sample selection algorithms in terms of sample selection quality,compression ratio and running time.The experiment verified the proposed The feasibility and efficiency of big data sample selection algorithm on the same big data platform.4.Compare the proposed three big data sample selection algorithms on two different big data platforms with sample selection quality,compression ratio,running time and synchronization times as experimental indicators,and get some valuable conclusion,To provide good help to those engaged in related research.

Keywords/Search Tags:

Big data, Sample selection, Vote entropy, Genetic algorithm, Locality sensitive hashing

PDF Full Text Request

Related items

1	Research Of Approximate K-Nearest Neighbors Search Algorithm Based On Locality Sensitive Hashing
2	Research Of DBSCAN Algorithm Based On Locality Sensitive Hashing Method
3	Clustering And Locality Sensitive Hashing Algorithms On Text Stream Data Under Classification-oriented Measure
4	Research On Integrated Algorithm Of Locality Sensitive Hashing And Matrix Factorization On GPU Platform
5	Working Towards Performance Analysis Of Locality Sensitive Hashing
6	Locality Sensitive Hashing Index Based On Neighborhood Collision Counting
7	Research On Similarity Image Retrieval Based On Locality Sensitive Hashing And Structured P2P Network
8	Research On DBSCAN Clustering Algorithm Based On Locality Sensitive Hashing
9	Towards Performance Analysis Of Locality Sensitive Hashing
10	Research About Image Retrieval Based On Hashing Technology