Font Size: a A A

Entropy Difference And Kernel-based Oversampling Technique Research

Posted on:2021-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:X WuFull Text:PDF
GTID:2518306047488234Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Take the data of exponential growth as the foundation,and the high technology such as information technology and computer network,the human society has entered a new era of data.How to mine more valuable information from large amounts of the data and classify it becomes specifically important.Although data mining technology has gradually matured and is being dealt with in a wide range of practical problems,the field still faces many challenges,such as the problem of imbalanced datasets classification.Oversampling is often used as a pretreatment method for imbalanced datasets.Specifically,synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances.However,the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not.This paper proposes an entropy difference and kernel-based SMOTE technique(EDKS)which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier.First,the EDKS method maps the input data into a feature space to increase the separability of the data.Then EDKS calculates the entropy difference in kernel space,determines the majority class and minority class,and finds the sparse regions in the minority class.Moreover,the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability.Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution.In order to verify the effectiveness of the algorithm,seven other classical oversampling algorithms on 19 publicly imbalanced datasets were compared for competitive experiments.Experimental results show that the proposed method performs significantly better than other algorithms on multiple benchmark imbalanced datasets.In addition,the concept of dangerous set and its three use strategies including Entropy-based Dangerous set oversampling algorithm,Entropy-based safe set oversampling algorithm and Entropy-based adaptive oversampling algorithm were proposed based on the entropy of the local density information in this paper.Experimental results show that these algorithms can effectively improve the performance of classic oversampling algorithms.For follow-up studies how to use entropy information theory processing imbanlanced data provides a successful experience.
Keywords/Search Tags:Imbalanced dataset, Oversampling, Kernel space, Entropy difference, SVM
PDF Full Text Request
Related items