The Research Of Imbalanced Data Based On Oversampling Technique

Posted on:2019-10-07

Degree:Master

Type:Thesis

Country:China

Candidate:H Wang

Full Text:PDF

GTID:2428330545985537

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the field of data mining,according to the data in the data set,classification technique trains a classification function or constructs a classification model to predict the class marks of unknown instances.In the study of imbalanced data classification,due to the small number of minority class samples,it is difficult to classify the minority class samples correctly,so it is very important to improve the classification accuracy of minority class samples.At present,there are two kinds of techniques for imbalanced data classification,one is data level and the other is algorithm level.The former mainly preprocess the original training set before classification,including over sampling technique and under sampling technique.The latter mainly proposes a new algorithm for imbalanced data characteristics or improves existing algorithms to adapt to data imbalance.In order to improve the classification accuracy of the minority samples of imbalanced data set,the following three studies are mainly proposed on the over sampling techniques of the data level.Firstly,the clustering technology is combined with the over sampling technology,and the ClusteredSMOTE_Boost is proposed,which is based on the clustering technique.The algorithm uses clustering technology to divide minority class samples of data sets into boundary and non-boundary samples,and then divide all minority class samples into several clusters.When new samples are synthesized for minority class boundary samples,new samples are closer to the inside of minority class samples.When new samples are synthesized for small non-boundary samples,new samples are closer to the center of the cluster of non-boundary samples.Experimental results show that the algorithm can effectively improve the classification accuracy of minority class samples.Secondly,in order to make the decision boundary of the original training set not complicated,we propose an over sampling algorithm GR_InsideOS based on the inside sample of minority class.This algorithm only allows the inside sample of minority class to participate in the synthesis of new samples,so that the new samples are in the internal of minority class sample,the classification boundary is not complicate.On this basis,the CGR-InsideOS algorithm based on the clustering is proposed,and the clustering technology is used to make the new samples close to the center of the cluster in the minority class,so as to ensure that the decision boundary of the original training set is not complicated.Theexperimental results show that the two algorithms can effectively improve the classification performance of minority class samples on the premise of ensuring the overall accuracy.Thirdly,the over sampling algorithm based on the inside sample is combined with multiple learning technique,we propose two kinds of multiple learning techniques IRML and IKCML based on the over sampling the inside sample of minority class.The two algorithms select some samples from the original training set to form K subsets.And then we use the GR-Inside OS algorithm to synthesize new samples,create K new subsets,and create K classifier.IRML algorithm randomly selects some samples in the original training set,while the IKCML algorithm selects sample with the K crossing method.The latter ensures that each sample is learned the same number of times.The experimental results show that the combination of GR-InsideOS algorithm and multiple learning algorithm is necessary...

Keywords/Search Tags:

Data mining, Classification of imbalanced data, Over-sampling, Multiple ensemble learning, Instance weights, Growth ratio

PDF Full Text Request

Related items

1	The Algorithm Research Of Associative Classification And Classification Based On Imbalanced Data
2	The Research Of Imbalanced Data Classification
3	Research On Imbalanced Data Classification Algorithms Based On Ensemble Learning
4	Research On Imbalanced Data Classification Based On Sampling Method And Ensemble Learning
5	Research And Application Of Imbalanced Data Classification Algorithm Based On Ensemble Learning
6	Imbalanced Data Classification Algorithm Based On Unsupervised Intelligent Under Sampling Method
7	An Adaptive Sampling Ensemble Classifier For Learning From Imbalanced Data Sets
8	Camplaints Text Classification Research Of Imbalanced Data Sets
9	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
10	Research And Application On Imbalanced Data Set Classification Problems