Font Size: a A A

Research On Methods For Classifying Imbalanced Data

Posted on:2021-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:H J RenFull Text:PDF
GTID:2428330602973787Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced data classification occurs in many real-world applications such as medical diagnosis,network intrusion detection,and biological data analysis.In reality,more important information might be hidden in the minority class samples,and it might suffer a great loss to misclassify the minority class samples.However,traditional machine learning classification algorithms are not suitable for dealing with class imbalance tasks,because they tend to produce relatively low accuracy on the minority class samples.Therefore,it is of great significance to develop new methods for classifying imbalanced data.This thesis proposes two methods to improve the classification performance of imbalanced data from data and algorithm level.The main contributions are as follows:(1)At data level,an information granulation-based data preprocessing for classifying imbalanced data,named IGDP,is proposed.At present,most of the resampling algorithms ignore the distribution of the original dataset.Considering the essential characteristic of imbalanced data,we aggregate similar samples as a whole to analyze data and build a classification model from the perspective of granular computing.In IGDP,following the principle that the number of the majority class information granules is similar to that of the minority,K-means++ is used to build information granules,which simplifies the determination of granularity level;The “point marking sub-attribute” is proposed to describe information granules,which solves interval marking difficulty of the “sub-attribute” in some cases;The test data granulation based on class preliminary prediction is proposed to fit the distribution of training information granule set,which makes the samples after granulation have the characteristic of the majority and the minority class information granules.Three groups of experiments were carried out on 8 KEEL datasets,and the outcomes of the algorithms are evaluated by three criteria: F-measure,G-mean,and AUC.The results show that IGDP can alleviate the classification difficulty caused by the class overlapping and improve the classification performance of imbalanced data to a certain extent.(2)At algorithm level,the ensemble algorithms(CCBoost and CCBagging),which combine Cluster Centroids(CC)that is a clustering-based undersampling,are proposed.The combination of data resampling and ensemble algorithm is an effective method to tackle class imbalance.In CCBoost,the resampling randomness of CC can ensure the diversity of training subsets in Ada Boost.M2 iteration,and the strategy “nearest neighbor of each cluster center” is used for resampling,which is advantageous that the samples with updated weight have a greater chance to be selected into a new round of base learner training.In CCBagging,the undersampling strategy of CC is determined automatically according to the dataset sparsity,because the parallel learning of the base learners in bagging does not need to consider the sample weights.CART and SVM are utilized as base learners to compare three groups of experiments on 10 KEEL datasets,and F-measure,G-mean,AUC are employed to evaluate the performance of the algorithms.The experimental results illustrate that CCBoost and CCBagging are better than other comparison algorithms to some extent.
Keywords/Search Tags:imbalance classification, data preprocessing, information granulation, ensemble learning, undersampling
PDF Full Text Request
Related items