Font Size: a A A

Research Of Distributed Support Vector Machine (SVM) Based On Hadoop Cloud Platform

Posted on:2015-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:K NiuFull Text:PDF
GTID:2268330428462822Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
SVM (Support Vector Machine, SVM) is a machine learning methodbased on statistical theory proposed by Vapnik et al. It is based on theVC dimension theory of statistical learning theory and minimumstructure risk principle and as a new classification method, shows agood performance in the treatment of the small sample, nonlinear,high dimensional datasets and pattern recognition problems.Therefore, this method gets more and more favour of the expertsand scholars from various fields and becomes a powerful tool to solvethe classification and regression problems in the data miningtechnology.However, with the gradually increasing size of the data sets, theSVM algorithm gets slowly in the process of training the dataset inorder to find the global optimal support vectors, and takes up a lot ofresources of computer, even can’t obtain the training model in theeffective time and the allowed practical conditions.The proposed of cloud computing has brought the dawn to the technology of massive data mining. With powerful storage capacity ofdistributed file system on cloud platform, while the parallelism of thetraditional data mining algorithms provides a good opportunity todevelopment of massive data mining technology.This paper explained in depth about the Distributed FileSystem(Hadoop Distributed File System, HDFS) and MapReducedistributed programming framework of Hadoop which is the mostpopular cloud platform and the inner workings mechanisms ofMapReduce computing framework, and built a fully distributedHadoop platform based on Hadoop-1.0.0in the Linux environment.Relying on HDFS, the Hadoop cloud platform achieves a sub-blockstorage to large data sets.Through reading the dfs.block.size property of the configurationfile named hdfs-site.xml, the data set was divided into blocksaccording to capacity in this paper, and then using a parallel SVMbased on MapReduce programming framework to train the allocatedblocks in the datanodes.The parameters in the training process of the traditional supportvector machine algorithm mainly depend on the experience. Thetype and parameters of the kernel function and the penalty factorwere optimized together with genetic algorithm in this paper.The experimental results showed that, compared with the traditional SVM algorithm relies on empirical values for parametersetting, the prediction accuracy using the genetic algorithm tooptimize the parameters of SVM algorithm has been more significantimprovement.A series of experiments analyzed the feasibility and performanceof the algorithm proposed in this paper on UCI standard data sets,through training time, prediction accuracy and other aspects.The results showed that compared with the traditional SVMalgorithm, the paralleled SVM algorithm gives a more obvious reduceto the complexity of the training time, with no significant decrease inthe prediction accuracy at the same time.Meanwhile, the paper using acceleration ratio to analyze therelationship between the number of nodes and the required trainingtime of the paralleled algorithm.The experimental results showes that accelerating ratio becomesfaster, which demonstrates a rising trend with the number of nodes inthe cluster increasing.
Keywords/Search Tags:Hadoop, Massive Data Mining, Genetic Algorithm, Support Vector Machine(SVM)
PDF Full Text Request
Related items