Svm Algorithm Optimization And Application In Text Classification Based On Hadoop

Posted on:2016-07-16

Degree:Master

Type:Thesis

Country:China

Candidate:H Tao

Full Text:PDF

GTID:2298330467993054

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of Internet technology, every trade generates oceans of data, how to get meaningful information from these data becomes a serious problem. For large data sets, the processing capacity of original data mining algorithms is limited.Support vector machine (SVM) algorithm is rarely over-fitting. For linearly inseparable data sets or high dimension feature vector datasets, accuracy rate of support vector machine algorithm is relatively high. So, SVM is suitable for text data sets. But for large data sets, support vector machine algorithm has high computational complexity, and the running time is relatively long.Combined with the popular cloud computing platform Hadoop, this paper proposes a parallel cascade support vector machine (PCSVM) based on Hadoop Map Reduce. The algorithm uses a method similar to the cascade support vector machine (CSVM). The training data are divided into multiple sub-training sets on MapReduce-based model, these sub-datasets are cascaded hierarchically and trained respectively, and then, get the support vectors and obtain a classification mode. In parallel training process, in order to reduce the differences in the distribution of training samples which may impacts the results of classification, the algorithm also uses the feedback to optimize the resulting classifier. Experimental results show that Hadoop-based PCSVM algorithm effectively reduces training time and improves the speed of classification in the condition of ensuring high accuracy rate.Apache Spark is a lightweight, fast cloud computing platform, it does not require multiple read and write from the Hadoop Distributed File System. Spark is more suitable than Hadoop for iterative algorithms. Therefore, this paper proposes a parallel support vector machines method on Spark, which uses a budgeted mini-batch parallel gradient descent (BMBPGD) algorithm. BMBPGD uses removal budget maintenance method to keep the number of support vectors (SVs), it has constant space and time complexity per update. The experiment results show that BMBPGD achieves higher accuracy than SVMWithSGD algorithm in MLlib on Spark environment, and it takes much shorter time than LibSVM.This article also describes our laboratory developed system Big Cloud-Parallel Data Mining (BC-PDM), the parallel support vector machine algorithm is integrated into the system. This article explains how to integrate the parallel support vector machine algorithm, and illustrates detailed process of text data classification by using parallel support vector machine module in BC-PDM system.

Keywords/Search Tags:

support vector machines, Hadoop, Spark, parallelcomputing, gradient descent

PDF Full Text Request

Related items

1	Imbalanced Stochastic Gradient Descent Online Algorithm For SVM
2	A Study On Large Scale Nonlinear Support Vector Machines
3	Optimization And Application Of SVM Algorithm Based On Spark
4	A Study On The Fast Training Methods Of Support Vector Machines Based On Coordinate Descent
5	Methodologies And Applications For Solving Large-scale Support Vector Machines
6	Spark-based SVM Algorithm Optimization And Application In Text Classification
7	A Fast Optimization Algorithm For Support Vector Machines
8	Research On Support Vector Machine Based On Improved Loss Function
9	Based On Lssvr Improved Rbf Neural Network Algorithm And Its Application
10	Multi-view Generalized Eigenvalue Proximal Support Vector Machines