Font Size: a A A

The Research Of Imbalanced Data Based On Oversampling Technique

Posted on:2019-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2428330545985537Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the field of data mining,according to the data in the data set,classification technique trains a classification function or constructs a classification model to predict the class marks of unknown instances.In the study of imbalanced data classification,due to the small number of minority class samples,it is difficult to classify the minority class samples correctly,so it is very important to improve the classification accuracy of minority class samples.At present,there are two kinds of techniques for imbalanced data classification,one is data level and the other is algorithm level.The former mainly preprocess the original training set before classification,including over sampling technique and under sampling technique.The latter mainly proposes a new algorithm for imbalanced data characteristics or improves existing algorithms to adapt to data imbalance.In order to improve the classification accuracy of the minority samples of imbalanced data set,the following three studies are mainly proposed on the over sampling techniques of the data level.Firstly,the clustering technology is combined with the over sampling technology,and the ClusteredSMOTE_Boost is proposed,which is based on the clustering technique.The algorithm uses clustering technology to divide minority class samples of data sets into boundary and non-boundary samples,and then divide all minority class samples into several clusters.When new samples are synthesized for minority class boundary samples,new samples are closer to the inside of minority class samples.When new samples are synthesized for small non-boundary samples,new samples are closer to the center of the cluster of non-boundary samples.Experimental results show that the algorithm can effectively improve the classification accuracy of minority class samples.Secondly,in order to make the decision boundary of the original training set not complicated,we propose an over sampling algorithm GR_InsideOS based on the inside sample of minority class.This algorithm only allows the inside sample of minority class to participate in the synthesis of new samples,so that the new samples are in the internal of minority class sample,the classification boundary is not complicate.On this basis,the CGR-InsideOS algorithm based on the clustering is proposed,and the clustering technology is used to make the new samples close to the center of the cluster in the minority class,so as to ensure that the decision boundary of the original training set is not complicated.Theexperimental results show that the two algorithms can effectively improve the classification performance of minority class samples on the premise of ensuring the overall accuracy.Thirdly,the over sampling algorithm based on the inside sample is combined with multiple learning technique,we propose two kinds of multiple learning techniques IRML and IKCML based on the over sampling the inside sample of minority class.The two algorithms select some samples from the original training set to form K subsets.And then we use the GR-Inside OS algorithm to synthesize new samples,create K new subsets,and create K classifier.IRML algorithm randomly selects some samples in the original training set,while the IKCML algorithm selects sample with the K crossing method.The latter ensures that each sample is learned the same number of times.The experimental results show that the combination of GR-InsideOS algorithm and multiple learning algorithm is necessary...
Keywords/Search Tags:Data mining, Classification of imbalanced data, Over-sampling, Multiple ensemble learning, Instance weights, Growth ratio
PDF Full Text Request
Related items