Imbalanced Learning Based On Data-Partition And Sampling Technique

Posted on:2020-09-25

Degree:Master

Type:Thesis

Country:China

Candidate:J Zhou

Full Text:PDF

GTID:2428330575961129

Subject:Systems analysis and integration

Abstract/Summary:

PDF Full Text Request

The problem of class imbalance is an important research field in machine learning.It is characterized by a fact that severe imbalance exists in the distribution of examples.In some practical applications,correctly identifying the minority class examples receives more attention than correctly identifying the majority class examples.For example,in cancer detection,only a few examples belong to the cancer cases,and how to correctly identify these cancer cases is more important.Conventional classification methods such as k-nearest neighbors,C4.5,naive Bayesian,support vector machine,etc.usually try to learn models with high accuracy,which often leads to a fact that the models often ignore or misclassify the minority class examples.For tackling this problem,this paper proposes two imbalanced learning algorithms:(1)Imbalanced learning based on data-partition(ILDP)is proposed.In the learning stage,ILDP uses partition method to partition the majority class set into several clusters,combines each cluster with the minority class set into several new training sets;ILDP learns a classifier based on each training set,thus,a classifier repository consisting of several classification models is constructed.With respect to the prediction stage,for a given example to be classified,the proposed algorithm uses the partition model constructed in the learning stage to select a model from the classifier repository to predict the example.(2)Imbalanced learning based on data-partition and sampling(ILDPS)is proposed.Similar to ILDP,ILDPS uses partition method to partition the majority class set into several clusters,and each cluster is combined with the minority class set to obtain several new data sets.Different from ILDP,ILDPS applies sampling technique to each data set to obtain a new training set to learn a classifier.Therefore,ILDPS also constructs a classifier repository consisting of several classification models.In ILDP and ILDPR,the partition method has two important functions: fully consider the local features of the samples in the majority class set;and obtain a relatively balanced training data set.In ILDPS,the main role of sampling technique is to further balance each training set to learn a model with higher generalization capabilities.Comprehensive experiments on KEEL data sets show that the two proposed methods both enhance the classification performance of the conventional classification methods;and ILDPS further enhances the performance of ILDP,and outperforms some other existing methods on evaluation measures of recall,g-mean,f-measure and AUC.

Keywords/Search Tags:

class imbalance, data-partition, resampling, K-Means, hierarchical clustering

PDF Full Text Request

Related items

1	Research On The Extension Of Fuzzy C-means Clustering Algorithm
2	Studying Class Imbalance Characteristics And Classification Methods On Internet Traffic Flows
3	Mining Analysis Of Mobile Phone Sales Customers Based On Partition And Hierarchical Clustering Method
4	Cluster Study Based On Functional Magnetic Resonance Imaging Data
5	Research On Resampling Methods For Imbalance Data
6	Research On Data Imbalance In Visual Tracking
7	The Study And Development Of Hierarchical-K-means-Based Clustering Algorithm
8	Research Of Ensemble Classification Methods For Class-imbalance And Cost-sensitive Datasets
9	An Imbalanced Approach Towards Credit Card Fraud Detection Using Proximity Based Resampling And Classifier Ranking
10	Research On Partition-Based Efficient Clustering Algorithm For Large-Scale Data