Font Size: a A A

Research On Sampling Method Optimization In Imbalanced Data

Posted on:2022-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y W ZhuFull Text:PDF
GTID:2518306542963279Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Imbalanced data imposes challenges to many real-world applications.To be specific,traditional machine learning classification algorithms,such as decision tree,support vector machine,neural network,etc.often show unsatisfactory performance when dealing with imbalanced data.The current solutions to unbalanced data classification mainly include data preprocessing and algorithmic research.Synthetic Minority Oversampling Technique(SMOTE)is one of the most representative techniques in oversampling methods,but it has some shortcomings such as the blindness of the neighbor selection,class overlap and marginalization of data distribution.Meanwhile,studies in recent years show that samples from overlapping regions play an increasingly important role in improving the classification performance for imbalanced data.Therefore,from the perspective of data level,this paper studies how to optimize the sampling method effectively in order to improve the learning ability of traditional machine classification algorithms.The research contents of this paper are summarized as follows:(1)In view of the defect of SMOTE,this paper proposes an Adaptive Neighborhoodsensitive imbalanced Oversampling(ANO)to overcome the above deficiency.The main mechanism of this method is to synthesize samples under the constraint of local distribution information of different minority samples.It first learns the distribution of minority samples with a geometric representation-based detecting process.After that,considering the inherent complex characteristics of imbalanced data,Particle Swarm Optimization(PSO)is utilized to sense the local neighborhood information of each minority sample.And then,ANO uses it to constraint the oversampling process.Consequently,ANO gives the approximate optimal rebalanced dataset and the corresponding neighborhood of each minority sample.Experimental results on a great number of data sets show that the ANO proposed in this paper has a significant improvement of algorithm robustness and classification performance compared with numerous state-of-the-art comparison methods.(2)In view of the recent study of overlapping scenario,this paper proposes an Evolutionary Hybrid Sampling in Scenarios(EHSO)to deal with the overlapping samples from different classes.The main purpose of EHSO is to make the decision boundary more visible through removing useless majority class samples.It first perceives the overlapping areas by the nearest neighbor calculation,and an objective function is designed to effectively identify the overlapping samples with adverse effects on classification when reducing the imbalance rate and overlapping rate.Then,EHSO applies Evolutionary Algorithm(EA)to obtain optimal majority samples subset.Consequently,EHSO uses Random Oversampling(ROS)to rebalance the data distribution with minimal overfitting,avoiding the introduction of new unexpected samples different from original data.Numerical experiments on imbalanced datasets have demonstrated its superiority compared with other well-known sampling methods.
Keywords/Search Tags:Imbalanced data, Classification, Sampling method, Optimization algorithm
PDF Full Text Request
Related items