Research On Imbalanced Dataset Classification Based On Oversampling Technique

Posted on:2020-12-11

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Zhang

Full Text:PDF

GTID:2428330578964120

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer technology,especially the progress of computer hardware equipment,massive data set storage and processing technology have been integrated into all walks of life.Data mining a commonly used data processing technology in the industry,which provides decision makers with more decision information through data processing and model construction.In the process of using data mining to process data and build models,imbalanced classification problem is often encountered,that is,the number of samples of some classes in the classification problem is more than that of other classes.However,traditional classification algorithms assume that the data distribution is roughly balanced,so it is difficult to achieve good results when dealing with imbalanced data sets.Aiming at the classification of imbalanced data,we makes an in-depth study on the improvement of data level.The main work of this thesis is as follows:Firstly,the classical oversampling algorithms are introduced and analyzed in detail.we introduces three classical oversampling algorithms: SMOTE,Boderline-SMOTE and ADASYN,and analyzes their advantages and disadvantages according to the characteristics of each algorithm.The above analysis is verified by experimental results on multiple data sets.Secondly,in order to enhance the classification boundary and reduce the generation of noise samples,we proposed an oversampling algorithm LOTE based on L�vy distribution in which the L�vy distribution is integrated into the oversampling algorithm.According to the location of minority class sample,L�vy distribution is used to set the density distribution of the new samples.The sample at the boundary is at the highest point of L�vy distribution,so the algorithm can maximizes the density of the new samples synthesized at the boundary and thus enhance the classification boundary.The sample close to majority class is at the position where the L�vy distribution slope is small,so the density of the new samples here is slightly reduced compared to the boundary samples,which is beneficial to reduce the generation of noise samples.Because the samples close to the minority class are relatively safe,they are at the position where the slope of L�vy distribution is large,where the density of new samples is greatly reduced compare to the boundary samples,thus reducing the generation of useless samples.Experiments show that the proposed algorithm can improve the performance of the classifier effectively.Finally,it's easy to generate noise samples for the sampling algorithm when the dataset is linear non-separable.To solve this problem,we propose a sampling algorithm which combine the kernel-based sampling algorithm and the LOTE algorithm.The kernel-based over-sampling algorithm transforms the generation of new samples into the expansion of the Gram matrix of the data set,so that the synthesis of new samples can be carried out in the feature space.The combination of LOTE algorithm and kernel method can divide the minority class samples into the boundary samples,samples close to minority class and samples close to majority class in feature space.So the proposed algorithm can set the density of new samples more accurately and give full play to the advantages of LOTE in enhancing classification boundary and reducing noise generation.For the classification problem of imbalanced data,we improved it from the perspective of oversampling and propose LOTE algorithm and KLOTE algorithm.LOTE algorithm uses L�vy distribution to construct the density of new samples in oversampling,which can enhance classification boundary and reduce noise generation compared with existing algorithms.KLOTE algorithm is an extension of LOTE algorithm in the feature space,which can effectively improve the performance of classifier for data sets that are linearly indivisible in the original input space.

Keywords/Search Tags:

Imbalanced dataset, Oversampling, L�vy distribution, Kernel method

PDF Full Text Request

Related items

1	Entropy Difference And Kernel-based Oversampling Technique Research
2	Research On Imbalanced Datasets Classification Based On Machine Learning And Oversampling Methods
3	Research On Classification Algorithm For Imbalanced Data
4	Research On Imbalanced Oversampling Method For Internet Video Traffic Identification
5	Application Research Of Used-car Recommendation Based On Classification Method On Imbalanced Data Sets
6	Research And Application Of Equalization Method For Imbalanced Dataset
7	User Complaint Prediction System Based On The KPI Dataset From IPTV Set-Top Box
8	Classification On Imbalanced Datasets
9	Research On Oversampling Method For Multi-class Imbalanced Learning
10	Research On Imbalanced Dataset Classification Algorithm Based On Sampling