Font Size: a A A

Research On Adaptive Imbalanced Data Classification

Posted on:2021-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2428330614971800Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In many scenarios and application areas,imbalanced data learning is a common and long-standing problem.Such as telecommunications management,credit card fraud detection,diagnosis data of rare diseases,text mining,speech recognition,image processing,object detection,sentiment classification,etc.When imbalanced data problem exists in the binary classification problem,one type of data is the majority,the other type is the minority,resulting in a serious imbalance.In the binary imbalanced classification,the class containing more instances is the majority class,and the class containing fewer instances is the minority class.However,when the imbalance rate(between the number of the majority instances and the number of minority instances)is large,it is difficult to capture the minority class pattern.In this case,conventional classifiers usually prefer the majority class and cannot classify the minority class correctly.At the same time,imbalanced data classification is often affected by imbalance within classes,imbalance between classes,noisy data,and redundant data.Therefore,imbalanced data classification mainly faces the following challenges: how to select valuable samples,how to remove noise redundant samples,and how to more reasonably generate a small number of samples or delete useless majority samples.The challenge faces time and space complexity.In this paper,several methods are proposed to deal with imbalanced data in response to the above problems.The main work and contributions are as follows:Firstly,we proposed a novel active learning framework(COAL)to select and generate informative instances.In the imbalanced data,there is a lot of redundant instances information in the majority of samples,and the number of minority class instances is difficult to distinguish.To keep the structure and enhance the diversity of the original data,this paper uses clustering method to divide the majority class instances into several subsets in the initial stage,and use uncertainty strategy to select informative instances.Meanwhile,to avoid imbalance in the active learning process,we make use of oversampling method to balance the quantities between classes.A series of experiments have been conducted on real world dataset with COAL framework and other state-of-the-art methods.In the end,the superiority of proposed COAL framework can be certified through empirical comparisons and we obtain a satisfactory result.Secondly,we proposed a new framework(SSIC)to select informative instances with two main phases.The framework fully considers the statistical properties of dataset,adaptively selects valuable instances from the different classes,and combines cost-sensitive learning to construct an imbalance classifier.Firstly,SSIC constructs several balanced data subsets by combining partial majority-class instances and all minority-class instances.On each subset,SSIC sufficiently takes advantage of the characteristics of data to extract the discriminative high-level features and adaptively select the important samples,so that the redundant and noise data can be removed.Secondly,SSIC introduces a cost-sensitive support vector machine by automatically assigning proper weight on each instance so that the minority class can be treated as equal as the majority class.A series of experiments have been conducted on real world dataset with other state-of-the-art methods.In the end,the superiority of proposed SSIC framework can be certified through empirical comparisons and we obtain a satisfactory result.
Keywords/Search Tags:Imbalanced Data, Oversampling Method, Informative Instance Selection, Active Learning, Cost-sensitive learning
PDF Full Text Request
Related items