Font Size: a A A

Research On Key Techniques For Class Imbalanced Data Classification

Posted on:2022-04-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y ZhangFull Text:PDF
GTID:1488306497989879Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
After entering the 21 st century,the amount of data is growing explosively,and these data will be generated from different fields every day.How to identify and mine useful information from these massive data is very important.As one of the most key research branches in data mining,classification technology has very important scientific research value.However,in many practical applications,data often shows the characteristics of class imbalance,which brings difficulties for classification tasks.To date,a lot of methods have been proposed for the imbalanced data classification,and some of them have achieved good results.However,there are still some problems to be solved:(1)Over-sampling and under-sampling are two common data balance processing techniques.Over-sampling technique needs to synthesize a large number of new samples which easily adds noise,while under-sampling technique has to remove a large number of original samples which loses much useful information,so over-sampling or under-sampling technique can not solve the class imbalance problem well.Mixed sampling can alleviate the disadvantages of using over-sampling or under-sampling alone by simultaneously adopting over-sampling and under-sampling technologies.However,the performance of existing mixed sampling relies on that of over-sampling or under-sampling technique,so it can not eliminate the disadvantages fundamentally.Therefore,it is very meaningful to design a reasonable data balance processing technique,which can not add too much noise data and lose much useful information.(2)In the process of actual classification,we will face a more difficult scenario,which is high class imbalance.Generally speaking,the case where the imbalance ratio is greater than 10 is called highly class imbalance.At present,there are few classification methods focusing on highly class imbalanced data,and the performance of these methods will decrease heavily with the increase of imbalance ratio.In addition,the feature space exhibits the property of distinct linear inseparable,which further increases the difficulty of classification.Therefore,how to design a data balance processing strategy for high class imbalance and achieve more effective feature extraction method is a worthy problem to study.(3)Compared with the two-class classification,multi-class imbalanced data classification is more ubiquitous and challenging.The main challenges lie in:(i)the number of samples in each class is different,even there are big differences,how to set a unified sampling scale to make all classes having the same number of samples,and then reduce the negative impact of class imbalance;(ii)how to design more effective metric learning network to extract more discriminative features after data balance processing.However,most of existing methods only consider one of them or simply combine them,so the final classification performance of them is not ideal.Therefore,how to design an effective classification method for multi-class imbalanced data is of practical value and practical significance.In this thesis,we have studied the above three unsolved problems in existing imbalanced data classification tasks,and have proposed the corresponding innovative approaches.Some valuable achievements have been made in the process of research :(1)In view of the problem that existing data re-sampling techniques need to remove or synthesize lots of samples,we have proposed a new data re-sampling method based on balanced subset partition.This method divides the majority class into multiple subclasses,and then combines each subclass with the minority classes to form multiple balanced subsets.At the same time,we have given two balanced subset construction strategies: balanced subset construction based on random partition and balanced subset construction based on hierarchical division.The experimental results on six class imbalanced datasets fully prove the effectiveness of the proposed new data re-sampling method.(2)To solve the problem of highly class imbalanced data classification,we have proposed a generative adversarial network(GAN)sampling based deep multi-set feature learning approach.Our approach first designs a GAN based balanced subset construction strategy to ensure that each constructed balanced subset owns the similar distribution with original data,and then introduces deep metric learning technique into multi-set feature learning framework to address the non-linearity issue in the existing multi-set framework.In addition,to further enhance the proposed model to extract more favourable features so as to improve the classification performance,we design a new discriminant term and incorporate cost-sensitive learning in it.The experimental results on eight highly class imbalanced datasets from four different fields fully prove the effectiveness of the proposed approach.(3)To address the problem of multi-class imbalanced data classification,we have proposed an optimal balanced sampling based deep multi-set discrimination metric learning approach.We first propose a novel optimal balance sampling strategy directly designed for multi-class imbalanced data,which not only makes each class has the same sample number after sampling,but also requires the total sample number of sampling is minimum.To make full use of the advantages of multiple balanced subsets,we introduces consistency metric learning and difference metric learning into deep multi-set feature learning for the first time.In addition,considering the initial features used in multi-set feature learning are homogeneous,we tailor a new consistency metric learning objective function,in which we consider not only intra-set consistency but also inter-set consistency.The experimental results on six multi-class imbalanced datasets from three different fields fully verify the effectiveness of the proposed approach.
Keywords/Search Tags:Class imbalance, Data re-sampling, Balanced subset, Deep metric learning, Multi-set feature learining
PDF Full Text Request
Related items