Research On Imbalanced Data Classification Methods For Unsafe Samples

Posted on:2022-12-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2518306779469504

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

With the advent of the era of big data,the imbalanced classification problem has attracted more attention in a lot of application fields.Since traditional classification algorithms are usually based on the assumption that each category of samples has the same misclassification cost and the same size,these methods have little effect in practical applications.When dealing with imbalanced classification problems,resampling the original data set and improving traditional classification algorithms are two common solutions.In addition,the low recognition rate of minority class samples is the main difficulty of imbalanced classification problem,and the unsafe samples tend to have greater classification difficulty.However,it is also an important breakthrough to improve the classification performance of the algorithm.Therefore,in order to improve the recognition performance of minority unsafe samples in unbalanced data,this thesis proposes two improvement strategies from the data level and the algorithm level respectively.At the data level,this thesis proposes a density-modified hybrid sampling algorithm based on Borderline Smote and Tomek Links algorithm.First,the KNN algorithm is used to find the knearest neighbor samples of the target sample,and different types of the sample is differently processed according to the information of the neighbor samples to achieve the purpose of local density modified.This process significantly improves the local density of the minority class samples,and for those unsafe samples which have relatively small number of similar samples in the nearest neighbor samples have the greater density improvement,which is more helpful to the retention of sample information in subsequent resampling.Secondly,we use Borderline Smote to oversample the density-modified sample space so that the remaining minority class samples are generated at the decision boundary.Finally,the Tomek Links sample pairs are identified and eliminated in the balanced sample space to clear the decision boundary between classes.At the algorithm level,this thesis proposes an ensemble learning algorithm based on Random Forest,XGBoost and AdaBoost.First,according to the k-nearest neighbor sample information,the class attributes of the minority class samples are determined,and without changing the subset imbalance ratio,the dataset is split and reconstructed based on the class attributes of the minority class samples to generate safe sets,boundary sets and rare-outlier sets,which is used to achieve the purpose of increasing the proportion of minority samples with the same attributes in the subset.Secondly,using the identification advantages of different algorithms,Random Forest,XGBoost and AdaBoost algorithms are respectively used to construct base classifiers on safe sets,boundary sets and rare-outlier sets,so as to improve the recognition performance of some base classifiers for the minority unsafe samples.Finally,the idea of ensemble learning is adopted to synthesize the results of each classifier to realize the prediction of the target sample,which is helpful to improve the generalization ability of the model and the overall classification performance.In order to verify the effectiveness of the algorithms,we respectively conduct comparative experiments on 24 artificial datasets and 12 real datasets from UCI and KEEL databases.The experimental results show that the proposed hybrid sampling algorithm can significantly improve the classification performance of the data set,especially for those minority unsafe samples in the data set.And the proposed ensemble learning algorithm can significantly improve the recognition performance of minority unsafe samples without reducing the recognition rate of majority samples.

Keywords/Search Tags:

Imbalanced Data, Unsafe Samples, Hybrid Sampling Algorithm, Ensemble Learning

PDF Full Text Request

Related items

1	An Adaptive Sampling Ensemble Classifier For Learning From Imbalanced Data Sets
2	Hybrid Ensemble Learning For Imbalanced Data
3	Research On Imbalanced Data Classification Based On Sampling Method And Ensemble Learning
4	Research On Ensemble Approach For Classification Of Imbalanced Data Sets
5	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
6	Imbalanced Data Classification Algorithm Based On Unsupervised Intelligent Under Sampling Method
7	Research On Imbalanced Data Classification Algorithms Based On Ensemble Learning
8	Research On Over Sampling Algorithm Oriented To Subdivision Of Minority Class Samples In Imbalanced Data Set
9	Research On Software Defect Prediction Based On Hybrid Sampling And Integrated Learning
10	Two-class Imbalanced Big Data Classification Based On Data Reduction And Ensemble Learning