Font Size: a A A

Research On Support Vector Machine For Large Scale Imbalanced Data

Posted on:2017-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:X N DongFull Text:PDF
GTID:2428330569499063Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently years,big data analysis technology has run into an explosive development in a short time.Data mining plays a positive role in promoting the development of academia and industry.Classification is of vital importance for data mining and support vector machine is an excellent classification algorithm.However,it gives a low classification accuracy for imbalanced data.Furthermore,the conventional classification algorithms suffer from long training time due to the large scale of massive data,which encourages researchers to make further study on the distributed classification algorithm.In this paper,we explore support vector machine for large scale imbalanced data classification and we conduct an in-depth research on improving imbalanced data classification accuracy and decreasing training time.The main contribution of this paper are as follows:To solve the problem that classification algorithms has an unfavorable classification accuracy on imbalanced data,we propose ensemble support vector machine based on boosting to improve the imbalanced data classification accuracy of support vector machine.The algorithm uses a stratified under sampling algorithm based on clustering,which we present to preprocess the training data.Besides,We incorporate boosting learning thoughts into boostingsvm and optimize the updating rule of boosting learning.Experimental results demonstrate that stratified under sampling algorithm based on k-means efficiently balances the data and sample data could represent the distribution of original data,boostingsvm promotes the imbalanced data classification accuracy.Consecutive classification algorithm suffers from long training time when dealing with large scale imbalanced data.To address this problem,we propose distributed baggingsvm algorithm based on group training model.The algorithm incorporates optimized cascade support vector machine to preprocess data and splits training data into pieces to train classification algorithm in parallel.Experimental results show that our method can significantly reduce the training time at slight cost of classification accuracy.
Keywords/Search Tags:data mining, large scale imbalanced data, support vector machine, distributed machine learning
PDF Full Text Request
Related items