Font Size: a A A

Extremely Imbalanced And Overlapped Data Classification

Posted on:2022-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:S T GaoFull Text:PDF
GTID:2518306479493884Subject:Software Engineering
Abstract/Summary:PDF Full Text Request
With increasing powering of data storage and advances in data generation and collection technologies,large volumes of data become available in real-world applications.Among them,imbalanced class distribution datasets widely exist in various real-world applications,and existing canonical classifiers applied to imbalanced data classification often fail because they are based on the premise that the number of instances in each class is equal and the misclassification cost is the same.How to mine information from imbalanced data and build models are attract rising attention from researchers,and subsequently,a great number of approaches have been proposed.However,most of these models perform poorly under a scenario in which datasets are characterized with high class imbalance,class overlap and noisy data.In this paper,we delve into the preference,information loss and overfitting problem faced in imbalanced data classification from data scenarios.We explore the application of self-paced learning in the field of imbalanced data classification and the importance of overlapping region instances,respectively.We propose a novel framework called DAPS(DynAmic self-Paced enSemble)that contains two important steps:(1)reasonable and effective sampling to maximize the utilization of informative instances and avoid serious information loss;and(2)assigning proper instance weight to address the issues of noisy data and model overfitting.The main contributions of this paper are summarized as follows.1.Designing a dynamic self-paced sampling mechanism for training sample selection,which can select most reasonable and effective instances in training process,maximize the utilization of instances,avoid overfitting and information loss problem.Using a unique measure to compute the classify difficulty under different classifiers and different data distributions.2.Designing a instance weighting mechanism to deal with class overlapping and noisy,which can identify the instances in class overlapping region,and enhance the attention of important instances and weaken the learning of noisy data by different weights to different instances.3.Proposing a novel framework for classification of highly imbalanced,class overlapped and low-quality data called DAPS(DynAmic self-Paced enSemble).Most of the existing canonical classifiers(e.g.Decision Tree,Random Forest,GBDT)can be integrated in DAPS.The comprehensive experimental results on both synthetic and three real-world datasets show that the DAPS model could obtain considerable improvement in accuracy when compared to a broad range of models.
Keywords/Search Tags:classification, imbalanced data, overlapped data
PDF Full Text Request
Related items