Font Size: a A A

Studies On Performance Optimization Techniques For Big Data Learning Based On Cloud Computing

Posted on:2017-07-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:S HuangFull Text:PDF
GTID:1318330542989651Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The development of Internet makes life more convenient for people,and at the same time,large scale data are generated.The era of big data learning has come,and it has become a challenging problem for academics and industry to find information that people are interested in and help people to make decisions.There are challenging problems of big data learning,such as eliminating the influences of noisy data,training multiple models,training models efficiently and extracting samples efficiently.There are two research issues of this dissertation,optimization on performance of model training and optimization on performance of extracting samples.For the optimization on performance of model training,this dissertation proposes one method which supports training ensemble of online sequential extreme learning machine in parallel and one method which supports training multiple online sequential extreme learning machine models in parallel.For the optimization on performance of sample extraction,this dissertation proposes one method which supports extracting samples from massive multi-dimensional data source where data are updated frequently and one method which supports extracting samples from massive multi-dimensional data source where users may access data concurrently.The main contributions of this dissertation are summarized as follows:(1)For the challenge of eliminating the influences and training model efficiently in big data learning,this dissertation proposes a framework to train ensemble of online sequential extreme learning machine(EOS-ELM),which supports any combination of three ensembles methods including Bagging,subspace partitioning and cross validation.Based on EOS-ELM,this dissertation proposes parallel ensemble of online sequential extreme learning machine(PEOS-ELM)which is suitable for learning large scale data accurately and efficiently.This dissertation evaluates PEOS-ELM with real data and synthetic data.The experimental results show that the speedup of PEOS-ELM reaches as high as 40x on a cluster with maximum 80 CPU cores and the accuracy of it is at the same level with that of EOS-ELM.The experimental results also show that the accuracies of both EOS-ELM and PEOS-ELM are higher than OS-ELM.(2)For the challenge of training multiple models efficiently in big data learning,this dissertation proposes one method which supports training multiple online sequential extreme learning machine models in parallel(BPOS-ELM).BPOS-ELM uses one MapReduce job to train multiple OS-ELM models.BPOS-ELM predicts Map execution time and Reduce execution time according to historical statistics.BPOS-ELM uses two methods to predict Map execution time.One is based on regression and the other is based on k nearest neighbor and inverse distance weighted interpolation.It uses a method based on complexity analysis and regression to predict Reduce execution time.BPOS-ELM uses greedy algorithm to generate execution plan and collects execution information to further improve the accuracy of estimations of Map execution time and Reduce execution time.This dissertation evaluates BPOS-ELM with real data and synthetic data.The experimental results show that the speedup of BPOS-ELM reaches as high as 10x on a cluster with maximum 32 CPU cores.(3)For the challenge of extracting samples efficiently from massive multi-dimensional data source,this dissertation proposes a method which supports extracting samples from massive multi-dimensional data source where data are updated frequently.The method designs and implements an efficient index based on R-tree and HBase(R-HBase).R-HBase uses R-tree to index grids and it supports many kinds of space filling curves such as Z-order and Hilbert.Based on the index,this dissertation proposes algorithms for data insertion and sample extraction.This dissertation evaluates the method with synthetic data.The experimental results show that the insertion throughput of the proposed method reaches as high as five thousand insertions per second and the extraction speed of it is fast.(4)For the challenge of extracting samples efficiently from massive multi-dimensional data source where multiple users may read or write data concurrently,this dissertation proposes a method which supports extracting samples from massive multi-dimensional data source where users may read and write data concurrently.The method designs and implements an index based on R-tree and HBase named HMVR-tree.The index provides synchronized mechanisms to support concurrent read and write access.Based on the index,this dissertation proposes algorithms for data insertion and sample extraction.This dissertation evaluates HMVR-tree with synthetic data.The experimental results show that HMVR-tree has good scalability.
Keywords/Search Tags:big data learning, extreme learning machine, cloud computing, MapReduce, HBase
PDF Full Text Request
Related items