Font Size: a A A

Research On Imbalanced Dataset Classification Based On Oversampling Technique

Posted on:2020-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ZhangFull Text:PDF
GTID:2428330578964120Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology,especially the progress of computer hardware equipment,massive data set storage and processing technology have been integrated into all walks of life.Data mining a commonly used data processing technology in the industry,which provides decision makers with more decision information through data processing and model construction.In the process of using data mining to process data and build models,imbalanced classification problem is often encountered,that is,the number of samples of some classes in the classification problem is more than that of other classes.However,traditional classification algorithms assume that the data distribution is roughly balanced,so it is difficult to achieve good results when dealing with imbalanced data sets.Aiming at the classification of imbalanced data,we makes an in-depth study on the improvement of data level.The main work of this thesis is as follows:Firstly,the classical oversampling algorithms are introduced and analyzed in detail.we introduces three classical oversampling algorithms: SMOTE,Boderline-SMOTE and ADASYN,and analyzes their advantages and disadvantages according to the characteristics of each algorithm.The above analysis is verified by experimental results on multiple data sets.Secondly,in order to enhance the classification boundary and reduce the generation of noise samples,we proposed an oversampling algorithm LOTE based on Lévy distribution in which the Lévy distribution is integrated into the oversampling algorithm.According to the location of minority class sample,Lévy distribution is used to set the density distribution of the new samples.The sample at the boundary is at the highest point of Lévy distribution,so the algorithm can maximizes the density of the new samples synthesized at the boundary and thus enhance the classification boundary.The sample close to majority class is at the position where the Lévy distribution slope is small,so the density of the new samples here is slightly reduced compared to the boundary samples,which is beneficial to reduce the generation of noise samples.Because the samples close to the minority class are relatively safe,they are at the position where the slope of Lévy distribution is large,where the density of new samples is greatly reduced compare to the boundary samples,thus reducing the generation of useless samples.Experiments show that the proposed algorithm can improve the performance of the classifier effectively.Finally,it's easy to generate noise samples for the sampling algorithm when the dataset is linear non-separable.To solve this problem,we propose a sampling algorithm which combine the kernel-based sampling algorithm and the LOTE algorithm.The kernel-based over-sampling algorithm transforms the generation of new samples into the expansion of the Gram matrix of the data set,so that the synthesis of new samples can be carried out in the feature space.The combination of LOTE algorithm and kernel method can divide the minority class samples into the boundary samples,samples close to minority class and samples close to majority class in feature space.So the proposed algorithm can set the density of new samples more accurately and give full play to the advantages of LOTE in enhancing classification boundary and reducing noise generation.For the classification problem of imbalanced data,we improved it from the perspective of oversampling and propose LOTE algorithm and KLOTE algorithm.LOTE algorithm uses Lévy distribution to construct the density of new samples in oversampling,which can enhance classification boundary and reduce noise generation compared with existing algorithms.KLOTE algorithm is an extension of LOTE algorithm in the feature space,which can effectively improve the performance of classifier for data sets that are linearly indivisible in the original input space.
Keywords/Search Tags:Imbalanced dataset, Oversampling, Lévy distribution, Kernel method
PDF Full Text Request
Related items