Font Size: a A A

Study On The Data Driven Parallel Incremental SVM Learning Algorithm Based On Hadoop Framework

Posted on:2018-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:W J PiFull Text:PDF
GTID:2348330542462808Subject:Software engineering
Abstract/Summary:PDF Full Text Request
We are in the era of big data,the growth rate of the data has been far beyond the growth rate of stand-alone computing power.How to improve the ability of the classification algorithm to deal with massive data is is the problem which needs to be solved urgently.In the field of classification algorithm,SVM algorithm has been Mainstream classification algorithm with its good robustness and stability.Based on the principle of structural risk minimization in statistic learning theory,SVM algorithm effectively overcomes the problem of dimension disaster in the process of dealing with the high dimensional data by the classical statistical method.But SVM is known as a computing intensive algorithm,the traditional serial calculation methods is not suitable for processing massive data.In the face of large-scale training data,how to improve the training efficiency and incremental learning ability of SVM on the basis of ensuring the accuracy of separation has become a hot research topic of SVM in recent years.In order to solve this problem,the researchers combine the traditional data mining algorithms with the cloud computing platform,using the distributed computing power to improve the performance of the algorithm,and some better results are achieved.Through the above research,we can see that the traditional parallel processing method such as MPI,Grid Computing has the problems of complex development and poor scalability.Using cloud computing to improve the efficiency of SVM algorithm has gradually become the focus of research.Hadoop,as the main platform of cloud computing,has the characteristics of high efficiency and simple deployment.The standard MapReduce parallel computing model of classification algorithm mainly to get the final classification model through a Map and Reduce operations on the data set.But when the number of training samples is great,data must be divided into a large number of split,,even if computing resources are adequate,to much scheduling and communication operations will significantly Reduce the performance of the algorithm.In addition,incremental learning strategy make algorithm not only can adapt to dynamic data,but also can Reduce the requirements of the hardware through the selection of appropriate incremental.However,Faced with data mining scene which requires iterativecalculation such as incremental learning.the traditional MapReduce model may be not suitable.Firstly,the static training sample data which load repeatedly will lead to substantial consumption of network resources.Secondly,when the interative work begin,runtime enviroment need to be initialized repeatedly.Thirdly,a large number of iteration work will make the shuffle stage appear a lot of key-value pairs,resulting in network congestion.Focused on the dilemma that traditional SVM algorithms process huge volumes of training data sets,an efficient data driven parallel incremental Adaboost-SVM learning algorithm(PIASVM)based on Hadoop is proposed.An ensemble system make each classifier process a partition of the data and then integrate their results to get the combination classifier;Weights are used to depict the space distribution prosperities of samples which are to be iteratively reweighted during the incremental training stage,and forgetting factors is applied to select new samples and eliminate historical samples;Also,it is the Controller component based on HBase that we design to schedule the iterative procedure,which can persist the intermediate results and Reduce the bandwidth pressure of iterative MapReduce.The experimental results on multiple data sets demonstrate that the proposed algorithm has good performance of speedup,sizeup and scaleup,improve algorithm processing capacity of large-scale data effectively on the basis of high accuracy.
Keywords/Search Tags:Hadoop, HBase, SVM, incremental learning, ensemble leaning, forgetting factor, Controller component
PDF Full Text Request
Related items