Font Size: a A A

Analysis Of Imbalance Classification With Generative Data Augmentation

Posted on:2022-03-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y WangFull Text:PDF
GTID:1488306560493254Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of database technology and computer hardware,a huge amount of data is accumulated.Big data emphasizes the general concept and large size,but the valuable information,which is concerned,occupies a small proportion.For the past few years,cloud computing,big data,brain science and cognitive science have promoted the artificially intelligent.With the success of research,the imbalance classification problem has emerged in practical applications.It has been an urgent task to improve the accuracy of models on imbalanced data.The statistical learning theory reveals that the traditional modeling method is difficult to obtain satisfactory classification performance,and the relationship between the amount of training instances and model complexity is expounded through Vapnik-Chervonenkis Dimension theory.It has established a consolidated theoretical basis for the inherent law of studying with limited instances,and developed a series of machine learning algorithms.However,the existing traditional learning methods are influenced in some way when the minority classes contain a few instances in practical applications.Researchers have proposed various techniques to improve the classification performance for imbalanced data,among which data augmentation is a popular and effective technique.Traditional data augmentation strategies utilize linear interpolation between pairs to generate new instances,other data fitting techniques use the global characteristics of the minority classes,but overlook the local characteristics of the minority classes and the important information of the majority classes.Thus,in this paper,we proposed three data augmentation methods based on generative model,one for conventional generative method LAMO(paper title: Local Distribution-based Adaptive Minority Oversampling for Imbalanced Data Classification),which makes use of local class distribution to find sampling seeds and design data generator.In order to process image data with large dimensions,we proposed deep data augmentation methods with the aid of deep generative model.Different from most augmentation techniques,the proposed DGC(paper title:Deep Generative Model for Robust Imbalance Classification)and DGCMM(paper title:Deep Generative Mixture Model for Robust Imbalance Classification)models integrate data augmentation and classifier construction into a united end-to-end framework.Meanwhile,by generating new informative instances with posteriori distribution and transfer learning strategies,they can promote the classification performance.The proposed LAMO first selects informative boundary instances from minority classes as sampling seeds with the aid of local class distribution.Tough minority classes contain less instances,not all of them have important information for generating new data points.Thus,selecting informative sampling seeds can help design effective data generator.In order to ensure that synthetic instances have the same distribution of observed minority data,LAMO uses Gaussian Mixture Model(GMM)to approximate the probability density distribution of sampling seeds.According to the local distribution of sampling seeds,LAMO adaptively assigns proper weights(mixture coefficient)and variances(bandwidth)for components in GMM.DGC is proposed to obtain informative latent representation for robust prediction via both data perturbation and model perturbation.The latent variable is represented by a probability distribution over possible values rather than a single fixed value,which is able to enforce uncertainty of model and lead to stable prediction.Based on the learnt distribution via inference learning,more latent variables for minority classes are sampled to make up for the imbalance in the latent space,which realizes the data perturbation Therefore,the proposed DGC model can be taken as a joint generative model to approximate the joint distribution of input features,labels and latent variables,where the latent variables can characterize the essential structure hidden in the original data and can be used as the direct cause of labels.Data generation process is formulated by minimizing the worst-case expectation of optimal transport cost between real and generative data distributions.It is implemented via data(both features and labels)generation process and constraint between prior and posterior distribution on the latent code.During the training process,the latent code is assumed to be sampled from a standard Gaussian prior for convenient computation.However,faced with complex data in real life,different classes may have different characteristics,i.e.,single mean and variance are not proper to capture the structure of latent codes.Thus,we proposed DGCMM with GMM assumption on the latent variables,i.e.,another variable prior to the latent codes so that the codes are restricted to lie on different components in Gaussian Mixture Model.DGCMM proposed two data augmentation strategies in the latent space.The first one resembles that of DGC,generating new instances from the posterior distribution.The other one is based on transfer learning.The statistical characteristics of latent codes of majority classes is transferred to minority classes,which helps generate diverse and abundant minority class instances.
Keywords/Search Tags:Imbalanced Data, Long-tailed Distribution, Data Augmentation, Generative Model, Deep Generative Model, Transfer Learning
PDF Full Text Request
Related items