Font Size: a A A

Research On Big Data Sample Selection Based On MapReduce/Spark

Posted on:2021-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:D D SongFull Text:PDF
GTID:2428330620470570Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,big data is a very hot research topic.In a big data scenario,traditional machine learning algorithms encounter great difficulties and challenges.How to solve this problem and extend traditional machine learning algorithms to big data environment have important research value and significance.Sample selection is a feasible solution to solve this problem.It selects important samples from big data and remove unimportant samples,redundant samples and noise samples from big data set.This paper studies the problem of sample selection from big data,the main work includes the following four parts:1.Inspired by divide-and-conquer strategy and the idea of cross-validation,a sample selection framework based on MapReduce/Spark is proposed.The basic idea of this framework is to partition the big data set into several subsets.When selecting samples from a certain subset,a committee composed of classifiers trained from other subsets is used to evaluate the importance of the samples in this subset and select important samples in parallel.Under this framework,two sample selection algorithms are proposed:(1)Based on MapReduce/Spark and voting entropy,a big data crossover sample selection algorithm is proposed.The proposed algorithm uses the voting entropy to measure the importance of samples of data subsets,and on multiple cloud computing nodes,important samples are selected from local data subsets,and the proposed algorithm is implemented with MapReduce and Spark respectively.(2)Based on MapReduce/Spark and genetic algorithm,a big data crossover sample selection algorithm is proposed.This algorithm encodes the sample subset in binary,and uses average information entropy of samples of subset as fitness function,and conduct cross sample selection in an evolutionary way using MapReduce/Spark computing framework.2.This paper also proposes a big data sample selection algorithm based on MapReduce/Spark and locality sensitive hashing.The basic idea of this algorithm is to partition the big data set into several subsets and deploy them to different cloud computing nodes.On each node,the locality sensitive hashing transformation for local data subset is performed using MapReduce/Spark computing framework,and the samples with the same hash code are put into the same bucket,and samples are selected from each bucket in a certain proportion.3.The three proposed big data sample selection algorithms were tested and compared with the existing big data sample selection algorithms in terms of sample selection quality,compression ratio and running time.The experiment verified the proposed The feasibility and efficiency of big data sample selection algorithm on the same big data platform.4.Compare the proposed three big data sample selection algorithms on two different big data platforms with sample selection quality,compression ratio,running time and synchronization times as experimental indicators,and get some valuable conclusion,To provide good help to those engaged in related research.
Keywords/Search Tags:Big data, Sample selection, Vote entropy, Genetic algorithm, Locality sensitive hashing
PDF Full Text Request
Related items