Font Size: a A A

The Research Of Imbalanced Data Classification Algorithm Based On Support Vector Machine

Posted on:2015-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:S F HongFull Text:PDF
GTID:2298330422988594Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of information explosion, the large number of data has aroused people’sattention, thus it needs people to find their own regular patterns and to make full use of them.Classification problem are one of the most frequently encountered problems in dataprocessing. It has become an important research content of machine learning. Comparedwith the traditional classification methods, support vector machine has several merits asfollows: high generalization ability, absence of local minima and adaptation forhigh-dimension and small sample data, which can better solve the problems listed as follows:over-learning, dimension disaster and local minima, thus we give priority to support vectormachine (SVM) in this thesis. The main idea of Support Vector Machine (SVM) is that itmakes the training set be mapped to a high dimension space by using one kernel function.Previous studies have shown that support vector machine (SVM) has the betterclassification effect on the balanced datasets, but for the imbalanced datasets, it is difficultto get excellent results. The main reason is that classification hyperplane of Support VectorMachine (SVM) is only decided by a small number of support vectors. When support vectormachine (SVM) classifier is used to deal with class imbalance problem, its prediction isbiased. The more the number of samples, the smaller its classification error is, and viceversa.To solve the problems above, the major research work of this thesis is to study how toaddress class imbalanced problem by SVM. The main research contents include thefollowing two aspects:First, we proposed one SVM-based optimal decision threshold adjustment strategy(SVM-OTHR) and its ensemble version (EnSVM-OTHR) to handle binary classificationimbalanced problem. We expect that SVM-OTHR algorithm can help us answer a puzzledquestion: how far the classification hyperplane should be moved? Specifically, the strategyis self-adapting and it can find the optimal moving distance of classification hyperplaneaccording to the distribution of training samples. Furthermore, we also extend the strategy todevelop an ensemble version (EnSVM-OTHR) that can further improve the classificationperformance. The experimental results by10skewed data sets from UCI Data Repositoryindicated their superiority.Second, we proposed one SVM-based ensemble learning algorithm to deal withhigh-dimensional and multiclass imbalanced data. The idea of the proposed algorithm is first to transform multiclass to multiple binary classes by utilizing one-against-all codingstrategy. Next, we introduce feature subspace, which is an evolving version of randomsubspace that can generate multiple diverse training subsets. Then, we introduce one of twodifferent correction technologies, namely, decision threshold adjustment or randomunder-sampling, into each training subset to alleviate the damage of class imbalance. Finally,support vector machine (SVM) was used as base classifier, and a novel voting rule calledcounter voting was presented for making a final decision. Experimental results on eightskewed multiclass cancer microarray datasets indicated that our presented method wasobviously superior to many traditional classification methods, and can improveclassification performance to a large extent.
Keywords/Search Tags:Support Vector Machine (SVM), Class Imbalance Learning, EnsembleLearning, Classification, DNA Microarray Data
PDF Full Text Request
Related items