Font Size: a A A

Imbalanced Learning Based On Data-Partition And Sampling Technique

Posted on:2020-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2428330575961129Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
The problem of class imbalance is an important research field in machine learning.It is characterized by a fact that severe imbalance exists in the distribution of examples.In some practical applications,correctly identifying the minority class examples receives more attention than correctly identifying the majority class examples.For example,in cancer detection,only a few examples belong to the cancer cases,and how to correctly identify these cancer cases is more important.Conventional classification methods such as k-nearest neighbors,C4.5,naive Bayesian,support vector machine,etc.usually try to learn models with high accuracy,which often leads to a fact that the models often ignore or misclassify the minority class examples.For tackling this problem,this paper proposes two imbalanced learning algorithms:(1)Imbalanced learning based on data-partition(ILDP)is proposed.In the learning stage,ILDP uses partition method to partition the majority class set into several clusters,combines each cluster with the minority class set into several new training sets;ILDP learns a classifier based on each training set,thus,a classifier repository consisting of several classification models is constructed.With respect to the prediction stage,for a given example to be classified,the proposed algorithm uses the partition model constructed in the learning stage to select a model from the classifier repository to predict the example.(2)Imbalanced learning based on data-partition and sampling(ILDPS)is proposed.Similar to ILDP,ILDPS uses partition method to partition the majority class set into several clusters,and each cluster is combined with the minority class set to obtain several new data sets.Different from ILDP,ILDPS applies sampling technique to each data set to obtain a new training set to learn a classifier.Therefore,ILDPS also constructs a classifier repository consisting of several classification models.In ILDP and ILDPR,the partition method has two important functions: fully consider the local features of the samples in the majority class set;and obtain a relatively balanced training data set.In ILDPS,the main role of sampling technique is to further balance each training set to learn a model with higher generalization capabilities.Comprehensive experiments on KEEL data sets show that the two proposed methods both enhance the classification performance of the conventional classification methods;and ILDPS further enhances the performance of ILDP,and outperforms some other existing methods on evaluation measures of recall,g-mean,f-measure and AUC.
Keywords/Search Tags:class imbalance, data-partition, resampling, K-Means, hierarchical clustering
PDF Full Text Request
Related items