Font Size: a A A

Distributed SVM Algorithm With K-means

Posted on:2016-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:R WangFull Text:PDF
GTID:2348330488474312Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology, the amount of data generated by network growth is amazing; the complexity of the data structure is also growing intensified. At present, many scholars are studying the hot issues how to search for the useful information from the sea of complex data. Support Vector Machine is a very well-known method of supervised classification data with high accuracy to predict, rare overfitting, few parameters tuning and other advantages. Therefore, it becomes the preferred method to solve classification problems. However, when enlarging the scale of training data sets and even storage and processing capabilities far beyond a single node computer, large memory and long training time limits the traditional stand-alone SVM application in the field of big data. The study found that clustered based on distributed processing can effectively shorten the training time and solve the problem of large memory footprint. Accordingly, studying SVM algorithm on distributed parallel computing is particularly important.Currently, the prediction accuracy is relatively high distributed parallel SVM algorithms are mostly used multi- layer iterative full feedback mechanism to achieve. Multi- level iteration is remove no-SV layer by layer to retain SV, and then eventually preserved the SV is a global optimal solution of the original training set. These algorithms are used random partition methods to generate sub-sample set of parallel training. The experimental test demonstrated that there exist two deficiencies of generating sub-sample set by the means of random partition: the first is the distribution of sub-sample set and original data set is likely to be biased; the second is in a concurrent environment, every sub-sample set training using may vary. That result in the overall training model is not reliable, low final training prediction accuracy and repeatedly training results obvious jitter. To this end, this thesis proposes a parallel SVM optimization algorithm that based on k- means clustering to generate sub-sample set, using unsupervised k- means clustering algorithm divided oriented data sets, thus effectively avoid ing the problems randomly divided. Parallel SVM optimization algorithm is deployed to the popular distributed computing Hadoop platform to complete testing of experimental data sets.Experimental results show that the proposed optimization algorithm ca n not only effectively reduce the distribution deviation of generated sub-data sets and original data set when increase in the number of division by random partition, but also reduce the jitter of the whole training model. As a result, the algorithm is robust and better generalization ability to learn.
Keywords/Search Tags:Support Vector Machine, C lustering, Distributed Computing, Random Partition, Hadoop
PDF Full Text Request
Related items