Research On Methods For Classifying Imbalanced Data

Posted on:2021-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:H J Ren

Full Text:PDF

GTID:2428330602973787

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Imbalanced data classification occurs in many real-world applications such as medical diagnosis,network intrusion detection,and biological data analysis.In reality,more important information might be hidden in the minority class samples,and it might suffer a great loss to misclassify the minority class samples.However,traditional machine learning classification algorithms are not suitable for dealing with class imbalance tasks,because they tend to produce relatively low accuracy on the minority class samples.Therefore,it is of great significance to develop new methods for classifying imbalanced data.This thesis proposes two methods to improve the classification performance of imbalanced data from data and algorithm level.The main contributions are as follows:(1)At data level,an information granulation-based data preprocessing for classifying imbalanced data,named IGDP,is proposed.At present,most of the resampling algorithms ignore the distribution of the original dataset.Considering the essential characteristic of imbalanced data,we aggregate similar samples as a whole to analyze data and build a classification model from the perspective of granular computing.In IGDP,following the principle that the number of the majority class information granules is similar to that of the minority,K-means++ is used to build information granules,which simplifies the determination of granularity level;The �point marking sub-attribute� is proposed to describe information granules,which solves interval marking difficulty of the �sub-attribute� in some cases;The test data granulation based on class preliminary prediction is proposed to fit the distribution of training information granule set,which makes the samples after granulation have the characteristic of the majority and the minority class information granules.Three groups of experiments were carried out on 8 KEEL datasets,and the outcomes of the algorithms are evaluated by three criteria: F-measure,G-mean,and AUC.The results show that IGDP can alleviate the classification difficulty caused by the class overlapping and improve the classification performance of imbalanced data to a certain extent.(2)At algorithm level,the ensemble algorithms(CCBoost and CCBagging),which combine Cluster Centroids(CC)that is a clustering-based undersampling,are proposed.The combination of data resampling and ensemble algorithm is an effective method to tackle class imbalance.In CCBoost,the resampling randomness of CC can ensure the diversity of training subsets in Ada Boost.M2 iteration,and the strategy �nearest neighbor of each cluster center� is used for resampling,which is advantageous that the samples with updated weight have a greater chance to be selected into a new round of base learner training.In CCBagging,the undersampling strategy of CC is determined automatically according to the dataset sparsity,because the parallel learning of the base learners in bagging does not need to consider the sample weights.CART and SVM are utilized as base learners to compare three groups of experiments on 10 KEEL datasets,and F-measure,G-mean,AUC are employed to evaluate the performance of the algorithms.The experimental results illustrate that CCBoost and CCBagging are better than other comparison algorithms to some extent.

Keywords/Search Tags:

imbalance classification, data preprocessing, information granulation, ensemble learning, undersampling

PDF Full Text Request

Related items

1	Application And Research Of Optimization Method For Imbalanced Data
2	Imbalance Malicious Text Detection Based On Ensemble Learning
3	Research On Resampling Methods For Imbalance Data
4	Hybrid Ensemble Learning For Imbalanced Data
5	Research On Imbalanced Data Classification Methods Based On Resampling And Ensemble Learning
6	Two-class Imbalanced Data Classification Based On Diverse Data Generation And Ensemble Learning
7	Research And Application Of Imbalance Data Classification Based On SVM
8	Research And Application Of Ensemble Learning Based On Combined Resampling Methods
9	Hashing-based Undersampling Ensemble For Imbalanced Classification Problems And The Application In Activity Recognition
10	Studying Class Imbalance Characteristics And Classification Methods On Internet Traffic Flows