Font Size: a A A

Spark-based SVM Algorithm Optimization And Application In Text Classification

Posted on:2017-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:H Z MaFull Text:PDF
GTID:2428330566453060Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,the Internet applications are increasingly broad.Every day,huge amounts of data from business,medicine,social,scientific and other areas,through the computer networks,enter the data storage devices.Although information is becoming more and more diversified,the text data is still the main carrier of information expression.It has become an urgent problem to extract valuable information from massive text data and classification is an important task in text data mining.Support Vector Machine(SVM)based on the Vapnik-Chervonenkis dimension theory and structural risk minimization principle in statistics,does well in solving the classification problem whose samples have the characteristics of small size,high dimension and nonlinear,and it is rarely over-fitting.So,the SVM is suitable for using in text classification field.But due to the high computational complexity,the SVM's computation time will greatly increase while the training data set is large.Therefore,this thesis introduces a distributed computing framework named Spark which has superior performance in terms of iteration because of proposing Resilient Distributed Datasets(RDD).Spark Machine Learning Library(MLlib)includes several common machine learning algorithms,among which the SVM uses the underlying optimization algorithm named Stochastic Gradient Descent(SGD).In order to solve the SGD's local convergence,a new optimization algorithm named Simulated Annealing-Stochastic Gradient Descent(SA-SGD)is proposed in this thesis.And the Support Vector Machine with Simulated Annealing Stochastic Gradient Descent(SVMWithSASGD)algorithm is proposed which is optimized by SA-SGD.Experimental results show that in the Spark cluster environment,the average accuracy rate and regression of SVMWithSASGD algorithm are both higher than SVMWithSGD algorithm in MLlib and the classical LibSVM algorithm,which shows that the improved algorithm this thesis raise the possibility of escaping from local optimum and converge to the global,compared with SVMWithSGD algorithm.Finally,this thesis integrates SVMWithSASGD algorithm module which then applied in the multi-class text classification based on the Spark platform.It presents the detailed design and implemention of text preprocessing,feature selection and extraction,and other processes in text categorization,as well as the design of multi-class classification model based on the selected text corpus.The system test verifies the feasibility of this improved algorithm SVMWithSASGD in the text multi-class classification field.By doing comparative experiments with Spark MLlib SVMWithSGD and its improved algorithm Budgeted Mini-Batch Parallel Gradient Descent(BMBPGD),SVMWithSASGD's advancement is verified.
Keywords/Search Tags:support vector machine, stochastic gradient descent, simulated annealing, Spark, text classification
PDF Full Text Request
Related items