Large Data Sets Sample Selection Based On Map Reduce

Posted on:2016-06-17

Degree:Master

Type:Thesis

Country:China

Candidate:X H Pang

Full Text:PDF

GTID:2308330479977635

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of data storage technology, computer network technology and cloud computing technology, the bulk of data is also at stake, big data processing has become a problem to which academia and industry has paid their close attention, it is a new challenge for the traditional data mining algorithms to discovery the useful knowledge from big data. It is very meaningful to investigate the sample selection from large data sets.Based on Map Reduce, this paper propose a sample selection algorithm, which firstly employs the mapping mechanism of Map Recuce to partition the large data sets into some small subsets, and deploy them to different cloud computing nodes. The informative samples are selected in parallel with an instance selection algorithm. And then the Reduce mechanism of Map Recuce is used to collect the selected samples from different cloud computing nodes. Consequently, a selected sample subset is obtained. This process is repeated k times(k is a parameter defined by the user), and k sample subsets are gained. Finally, the voting method is used to select the most informative samples from the k subsets. The ELM classifier is trained with the selected samples, and the testing accuracy is verified on the testing set. The proposed algorithm is experimentally compared with the classic sample selection algorithms; the experimental results show that the proposed algorithm is effective and efficient.

Keywords/Search Tags:

Large data sets, Cloud computing, Sample selection, Map Reduce, Extreme learning machine

PDF Full Text Request

Related items

1	The Classification Of Imbalanced Large Data Sets Based On Map Reduce
2	Ensemble Of Oselm For Large Data Sets Classification
3	Research And Implementation Of XML Document Classification Based On Extreme Learning Machine In Cloud Environment
4	Studies On Performance Optimization Techniques For Big Data Learning Based On Cloud Computing
5	Research On Binary Imbalanced Large Data Classification And Its Application
6	Support Vector Machine Based On Boundary Sample Selection
7	Research On Extreme Learning Machine Under The Cloud Environment
8	Research On Indoor Positioning Algorithm Based On Extreme Learning Machine In Incomplete Data Sets
9	Cloud-based Platform For The Large-scale Manifold Learning Algorithm Research
10	Distributed Machine Learning With Adaptive Sample Selection