Font Size: a A A

Big Data Active Learning Based On Open Source Frameworks

Posted on:2019-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:C WangFull Text:PDF
GTID:2428330566965493Subject:Master of Engineering - Software Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of society and the progress of science and technology,massive data is produced every day and most of them are unlabeled data,such as web data,audio data and video data,etc.While labeling is usually difficult,furthermore time and labor consuming.Active learning is an efficient method for solving this problem,which iteratively selects important samples from unlabeled data for labeling by oracle and then adds the labeled samples to training set and retrain classifier with the updated training set until the classification accuracy meet the predefined requirement.With the advent of big data era,many challenges have been introduced to traditional active learning algorithms,and it is of great significance and value to study big data active learning.This paper investigates big data active learning based on open source frameworks,mainly focus on studying the scalability of traditional active learning algorithms in big data scenarios and their implementation with open source framework(such as Hadoop and Spark).In big data active learning,the big data is referred to the unlabeled data,while the labelled data is small or medium data set.Accordingly,the goal of big data active learning is to select only the most valuable samples from unlabeled data with the smallest cost for labeling by oracles.Specifically,this paper employs uncertainty as the criterion for selecting informative samples and uses extreme learning machine as the classifier,to studies the two most popular open source frameworks Hadoop and Spark.On this basis,two algorithms are proposed to solve the active learning in big data environment.In the study of big data active learning based on Hadoop,we mainly studied the implementation of big data active learning algorithm with MapReduce,making the traditional active learning algorithms working parallelly on Hadoop platform.In the study of big data active learning based on Spark,we studied the implementation of big data active learning algorithms by the RDD operation of Spark,and iteratively processing the big data by memory calculation in Spark clusters.In addition,we present an experimentally comparative study on big data active learning with the two big data open source frameworks(i.e.Hadoop and Spark),and some valuable conclusions have been obtained,which can be very helpful for researchers in related fields.
Keywords/Search Tags:Big data, Active learning, Open source framework, Extreme learning machine, Sample selection
PDF Full Text Request
Related items