Font Size: a A A

Svm Algorithm Optimization And Application In Text Classification Based On Hadoop

Posted on:2016-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:H TaoFull Text:PDF
GTID:2298330467993054Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet technology, every trade generates oceans of data, how to get meaningful information from these data becomes a serious problem. For large data sets, the processing capacity of original data mining algorithms is limited.Support vector machine (SVM) algorithm is rarely over-fitting. For linearly inseparable data sets or high dimension feature vector datasets, accuracy rate of support vector machine algorithm is relatively high. So, SVM is suitable for text data sets. But for large data sets, support vector machine algorithm has high computational complexity, and the running time is relatively long.Combined with the popular cloud computing platform Hadoop, this paper proposes a parallel cascade support vector machine (PCSVM) based on Hadoop Map Reduce. The algorithm uses a method similar to the cascade support vector machine (CSVM). The training data are divided into multiple sub-training sets on MapReduce-based model, these sub-datasets are cascaded hierarchically and trained respectively, and then, get the support vectors and obtain a classification mode. In parallel training process, in order to reduce the differences in the distribution of training samples which may impacts the results of classification, the algorithm also uses the feedback to optimize the resulting classifier. Experimental results show that Hadoop-based PCSVM algorithm effectively reduces training time and improves the speed of classification in the condition of ensuring high accuracy rate.Apache Spark is a lightweight, fast cloud computing platform, it does not require multiple read and write from the Hadoop Distributed File System. Spark is more suitable than Hadoop for iterative algorithms. Therefore, this paper proposes a parallel support vector machines method on Spark, which uses a budgeted mini-batch parallel gradient descent (BMBPGD) algorithm. BMBPGD uses removal budget maintenance method to keep the number of support vectors (SVs), it has constant space and time complexity per update. The experiment results show that BMBPGD achieves higher accuracy than SVMWithSGD algorithm in MLlib on Spark environment, and it takes much shorter time than LibSVM.This article also describes our laboratory developed system Big Cloud-Parallel Data Mining (BC-PDM), the parallel support vector machine algorithm is integrated into the system. This article explains how to integrate the parallel support vector machine algorithm, and illustrates detailed process of text data classification by using parallel support vector machine module in BC-PDM system.
Keywords/Search Tags:support vector machines, Hadoop, Spark, parallelcomputing, gradient descent
PDF Full Text Request
Related items