Font Size: a A A

Design And Implementation Of Parallel SVM Algorithm For Large Scale Text Data

Posted on:2014-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:2298330422969049Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Support vector machine (Support Vector Machine, SVM) is aclassification algorithm based on statistical learning theory in datamining. Because of its less over-fitting, for characteristics caused byexcessive advantages of dimension disaster is not obvious and widely usedin the field of text classification, image recognition, patternrecognition. SVM classification training for massive data, slow trainingspeed, the training result and training model cannot be obtained in a veryshort period of time, the SVM algorithm can not be applied to large-scaledata processing. Therefore, in this paper, computing technology from theimproved SVM algorithm two times planning and application of distributedcomputing performance, to improve the SVM training to adapt to the massivedata size.First of all, this paper use the method of feasible directions forSVM two programming, the computation of higher performance is more simpleand new method. The method, by using the "coefficient adaptation method",instead of the original two programming method; at the same time, the newmethod in the original method to determine the process step coefficient,was reduced to a solving steps of quadratic equation with one unknown.Through the improvement of the two, the new method simplifies theoperation steps, reduces the computation complexity level.Secondly, the Hadoop parallel computing framework based on MapReducemodel, using the parallel SVM algorithm of the new, and the use ofdistributed storage scheme of HBase to storage data and calculationresults. To implement the SVM training process by combining theapplication of parallel computing and distributed storage technology,greatly enhance the ability of SVM high performance processing mass data. Finally, based on the above two improved, the realization of alarge-scale data set of text classification system. In a distributedcluster of8ordinary on the use of PC, and the same data size, made withrespect to improve performance of SVM serial training process of4-5times.Fully proved that the parallel SVM training, performance inclassification, classification speed, data processing on the advantagesof scale.
Keywords/Search Tags:Support Vector Machine, Quadratic Programming, FeasibleDirection Method, Text Classification, Hadoop, HBase
PDF Full Text Request
Related items