Font Size: a A A

Research On Parallel Support Vector Machine Algorithm In Big Data Environment

Posted on:2022-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:L X ZhangFull Text:PDF
GTID:2518306524498574Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and the advent of the era of big data,the continuous growth and accumulation of data makes all fields face the pressure of processing massive data.How to quickly and effectively collect meaningful information from large-scale data is an important research content.As a critical data mining method,support vector machine has a sound theoretical foundation,strong generalization ability,and global optimal solution acquisition ability.However,it is only suitable for small-scale data sets,and it will generates a huge amount of computational complexity when processing big data.With the widespread application of distributed frameworks such as Map Reduce,combining improving support vector machine algorithm and the distributed computing architecture has become a research hotspot in the big data environment.In recent years,the parallel support vector machine algorithm has achieved certain results in the field of data mining under the big data environment.But the data is complex and noisy in the big data environment,and the support vector machine algorithm has limitations,Which make the execution efficiency and classification accuracy of the parallel support vector machine algorithm are low in the big data environment.In order to improve the parallel support vector machine algorithm's ability to process large-scale data and classification performance,this paper mainly starts from three aspects.One is to eliminate the interference of noisy data,design a reasonable noise filtering strategy to preprocess the original data,and delete the big data environment.The second is to start with the support vector machine algorithm itself,and use the information granular method to filter class boundary samples,quickly reduce the size of the training set,and improve the execution efficiency of the algorithm.the third is to improve the stability of the parallel support vector machine model According to the diversity of features,the feature similar group is constructed,and multiple base learners are trained to obtain a stable classification model,thereby improving the overall performance of the parallel support vector machine algorithm.The main research work of these two parallel support vector machine algorithms is as follows:(1)The parallel SVM algorithm by using granularity and information entropyAiming at the problems of noise data sensitive and training sample redundancy of parallel SVM algorithm in big data environment,this paper have proposed a parallel SVM algorithm by using granularity and information entropy,named GIESVM-MR.Firstly,the algorithm proposed the NC(noise cleaning)method to evaluate the importance of each feature attribute and obtain the correlation between the sample and the category,which effectively identify and delete noise data.Secondly,a GDC(Data Compression based on Granulation)strategy is proposed,which screen the information granules to retain class boundary samples and delete non-support vectors.Then result in a smaller data set,and solve the problem of training sample data redundancy in a big data environment.Finally,the final classification model is generated by combining the idea of Bagging and Map Reduce computing model.Experimental results show that the GIESVM-MR algorithm not only effectively improves the classification accuracy,but also reduces the time complexity of parallel SVM algorithm in big data environment.(2)Parallel SVM algorithm using mutual information and artificial fish swarm algorithm based on Map ReducAiming at the problems of noise data sensitive,parameter selection difficult and model jitter larger of parallel SVM algorithm in big data environment,this paper proposes a parallel SVM algorithm using mutual information and artificial fish swarm algorithm based on Map Reduce,named MIAFSA-PSVM.Firstly,the algorithm proposed the NMI(Normalized mutual information)method to measure the correlation between characteristics and categories,and remove irrelevant features from the dataset,which effectively identify and delete noise data.Secondly,an improved artificial fish swarm algorithm(IAFSA)which use adaptive visual and step size(AVS)and improved fitness function is designed.In addition,based on IAFSA algorithm,the optimal parameters and feature subset are selected,which can overcome the difficulty in parameter selection of SVM.Finally,multiple feature similarity groups are generated based on feature similarity,then multiple base classifier groups are parallelly trained to obtain a strong classifier by combining Map Reduce computing model with the idea of ensemble learning,which greatly reduced model jitter.The experimental results show that MIAFSA-PSVM not only effectively improves the classification accuracy,but also reduces the time complexity of parallel SVM algorithm in big data environment.
Keywords/Search Tags:big data, noise, information entropy, support vector, feature similarity group
PDF Full Text Request
Related items