Font Size: a A A

Research On Classification Algorithm For Imbalanced Data Sets Based On Support Vector Machines

Posted on:2012-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:S W HaoFull Text:PDF
GTID:2218330368982094Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of modern computer technology, making the research and all areas of social life have accumulated large amounts of data, in order to convert these data into useful information and knowledge, data mining techniques emerged and developed rapidly.But there is a class of data set known as the imbalanced data set, this data set the number of a class of data is far greater than the number of another type of data and information provided by the minority class is often more important, so the classification of imbalanced data sets Data mining is becoming a hot research field. Support vector machine is built based on statistical learning theory of classification, has a solid theoretical basis for common data set than other classification algorithms achieve the best performance, but for the imbalanced data set is not very good classification results.This paper will first of all the characteristics of imbalanced data sets from the uneven start, The next proposed under-sampling based on cluster methods, By analyzing the obtained support vector machine classification in the imbalanced data set causes the failure, under the proposed sampling method used for majority class support vector for the under-sampling, the purpose is to remove part of the majority class samples to reduce the imbalanced degree of majority class and minority class, and then use SVM to train the new sample set, to improve the classification accuracy purposes.Current popular classification of imbalanced data sets dealing with one of the methods is cost-sensitive learning, but the support vector machine itself does not have the cost of sensitivity, it does not apply to consideration of cost-sensitive data mining, data sets based on decomposition of the proposed cost-sensitive support vector machine, through the output a posteriori probability and meta-learning process,an integrated reconstruction of misclassification cost of the new sample set, using the support vector machine on the reconstruction of the new training sample set, so that the minimum misclassification cost classification.Have carried out an algorithm for each simulation experiment, using different evaluation criteria, the experiment results and analysis of experimental results shows that the two algorithms are from improving the accuracy and to make the minimum misclassification cost have reached good results.
Keywords/Search Tags:data mining, imbalanced data set, SVM, cost-sensitive
PDF Full Text Request
Related items