Font Size: a A A

Research On The Design Of Spark-acclerated Boosting By Majority Voting Algorithm

Posted on:2019-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2428330548463425Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the age of big data,data mining technology has been further studied and improved,and distributed algorithms and online learning algorithms have been applied.Distributed storage and computing not only makes the massive data to be stored well,but can also make the computation efficiency be greatly improved.Therefore,the expansion of data mining algorithms to distributed platforms has been rapidly developed,such as Hadoop's mahout machine learning component,Spark's MLlib machine learning component,etc.Online learning algorithm receives a single piece of data or a small batch of data in real time for model training,which has the advantages of saving computer memory,fast computation speed,and applications to data mining under massive data.In the classification task of data mining,the classification algorithm based on boosting has emerged.The BBM(Boosting By Majority)algorithm proposed by Freund in 1995 has better classification performance.However,this algorithm is suitable for stand-alone small data sets.When the amount of data increases explosively,the traditional BBM algorithm can no longer meet the requirements.This article mainly contributes to two tasks: First,based on the BBM algorithm,the Spark data processing technology is used to propose a distributed BBM batch data processing algorithm,we called it BBM.Spark.Second,based on Online BBM algorithm,distributed stream data processing technology upon Spark Streaming is used to propose a new stream data processing algorithm,we called it BBM.Streaming.In this paper,the experiments of BBM.Spark and BBM.Streaming are carried out using 10 datasets.In the BBM.Spark experiments,the influence of the number of weak classifiers on the classification performance of the algorithm is studied.At the same time,the effects of parameters such as num-executors,executor-cores,executor-memory and partitions on the operating efficiency of the algorithm are also investigated.The experimental results show that with the increase of the number of weak classifiers,the classification accuracy also increases but tends to stabilize.When the other 3 parameters do not exceeding the resource limit,the operation efficiency is increased with the increasing of the parameter values.In the BBM.Streaming experiments,the influence of the number of weak classifiers on the classification accuracy is also studied,and the classification performance is compared with that of the Online BBM and VFDT(very fast decision tree algorithm).In addition,the influence of the parameters Partitions,Duratinons and MaxRatePerPartition on the operating efficiency is also studied.We also compare its running time efficiency with Online BBM and VFDT.The test results show that the online BBM and BBM.Streaming and VFDT are almost the same in terms of classification accuracy.In terms of operating efficiency,BBM.Streaming<online BBM<VFDT on small datasets and BBM.Streaming>online BBM>VFDT on large datasets.Based on large data and data stream,it is of great significance to study boosting.Distributed processing of big data can make full use of computing resources,empowering Boosting with more powerful classification performance under big data,and it can greatly speed up algorithms through distributed processing platform.
Keywords/Search Tags:Data mining, Boosting, Majority vote, Spark, Spark Streaming
PDF Full Text Request
Related items