Entropy Difference And Kernel-based Oversampling Technique Research

Posted on:2021-06-07

Degree:Master

Type:Thesis

Country:China

Candidate:X Wu

Full Text:PDF

GTID:2518306047488234

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

Take the data of exponential growth as the foundation,and the high technology such as information technology and computer network,the human society has entered a new era of data.How to mine more valuable information from large amounts of the data and classify it becomes specifically important.Although data mining technology has gradually matured and is being dealt with in a wide range of practical problems,the field still faces many challenges,such as the problem of imbalanced datasets classification.Oversampling is often used as a pretreatment method for imbalanced datasets.Specifically,synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances.However,the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not.This paper proposes an entropy difference and kernel-based SMOTE technique(EDKS)which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier.First,the EDKS method maps the input data into a feature space to increase the separability of the data.Then EDKS calculates the entropy difference in kernel space,determines the majority class and minority class,and finds the sparse regions in the minority class.Moreover,the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability.Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution.In order to verify the effectiveness of the algorithm,seven other classical oversampling algorithms on 19 publicly imbalanced datasets were compared for competitive experiments.Experimental results show that the proposed method performs significantly better than other algorithms on multiple benchmark imbalanced datasets.In addition,the concept of dangerous set and its three use strategies including Entropy-based Dangerous set oversampling algorithm,Entropy-based safe set oversampling algorithm and Entropy-based adaptive oversampling algorithm were proposed based on the entropy of the local density information in this paper.Experimental results show that these algorithms can effectively improve the performance of classic oversampling algorithms.For follow-up studies how to use entropy information theory processing imbanlanced data provides a successful experience.

Keywords/Search Tags:

Imbalanced dataset, Oversampling, Kernel space, Entropy difference, SVM

PDF Full Text Request

Related items

1	Research On Imbalanced Dataset Classification Based On Oversampling Technique
2	Research On Classification Algorithm For Imbalanced Data
3	User Complaint Prediction System Based On The KPI Dataset From IPTV Set-Top Box
4	Classification On Imbalanced Datasets
5	Research On Imbalanced Dataset Classification Algorithm Based On Sampling
6	Application Research Of Used-car Recommendation Based On Classification Method On Imbalanced Data Sets
7	Research On Oversampling Algorithm Of Unbalanced Data Set
8	Research And Application Of Equalization Method For Imbalanced Dataset
9	Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling
10	Research And Application Of Imbalanced Data Classification Based On Oversampling Algorithm