Font Size: a A A

Research And Optimization On Semiparametric Support Vector Machine Under Spark Framework

Posted on:2020-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q WangFull Text:PDF
GTID:2428330590959391Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of big data technology has led to continuous improvement of analysis processing technology in massive data.Some machine learning algorithms that perform well on small samples are gradually applied to big data learning scenarios.The semi-parametric support vector machine is a computational model with the advantages of both parametric and nonparametric model,which can control the complexity of the classifier and has high training efficiency,but when it comes to big data,the computation time is relatively longer.In this paper,the Semi-parametric Support Vector Machine(S-SVM)algorithm is studied in the big data environment,and the Spark computing framework is used to realize parallelization research and improvement.In this paper,we researched a S-SVM algorithm which uses the Sparse Greedy Matrix Approximation(SGMA)algorithm as the predefined model and uses the Iterative Reweighted Least Squares(IRWLS)process to calculate weights.In order to solve the problem of long operation time in big data,two methods are proposed to iteratively optimize the computational efficiency of the algorithm:(1)The parallelization of semi-parametric support vector machine in Spark is proposed to improve the efficiency of S-SVM,which employs Spark RDD technology to share Memory,reducing the storage space of network transmission and the count of disk IO,and utilizes Cholesky matrix decomposition method to decompose computing tasks into a series of sub-tasks that can be executed in parallel.(2)In the basis of parallel S-SVM,the combination of kmeans and SGMA algorithm is proposed to construct the predefined model.The cluster centers of kmeans algorithm is used to solve the kernel matrix in SGMA algorithm,which increases the efficiency of the calculation by reducing the scale of the matrix and the calculated amount.Experiments show that the parallel S-SVM algorithm based on Spark has higher computational efficiency and almost the same classification performance compared with the original single-machine algorithm.And the improved parallel S-SVM compared with the original one has the same advantages in classification accuracy and the AUC,with shorter operation time.Moreover,the number of cluster centers which is the new parameter of the new algorithm has little influence on algorithm performance.Furthermore,compared with BPPGD,P-PackSVM and SVMwithSGD algorithm,it was proved that the final optimized algorithm has a comprehensive superiority in classification accuracy,AUC of classifier,the period of training and classification.
Keywords/Search Tags:Spark, Semiparametric Support Vector Machine, kmeans, parallelization
PDF Full Text Request
Related items