Font Size: a A A

Unbalanced Data Classification Based On Resampling And Hybrid Ensemble

Posted on:2022-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2518306311464084Subject:Statistics
Abstract/Summary:PDF Full Text Request
The rapid development of information technology has produced complex data.How to obtain practical information from it is worth studying.Classification in machine learning has played a vital role in this regard.Traditional classification algorithms assume that there is little difference in the number of samples and the cost of each category being misclassified.However,data imbalance is common in classification problems.At this time,traditional classification algorithms are no longer suitable for classification of imbalanced data.Therefore,the classification of unbalanced data is a very practical research topic.Predecessors have combined resampling methods with integrated learning,such as SMOTEBoost,RUSBoost,EUSBoost,etc.However,the method based on undersampling ignores the useful information of negative samples,and the method based on oversampling increases the complexity of the model and is easy to overfit.In response to these problems,some scholars have proposed a hybrid ensemble method to integrate the ensemble.This means that Boosting is performed on the training set in the inner layer,and Bagging is performed on the Boosting classifier in the outer layer.The parallel integration method of Easy Ensemble is easier to implement than the serial integration method of Balance-Cascade,which can reduce the time complexity and improve the efficiency of the algorithm.However,EasyEnsemble does not process the minority sample set every time it generates a new training set.All the minority samples participating in the base classifier training are exactly the same,which may lead to over-learning of the minority samples.In order to solve the problem that the minority samples in EasyEnsemble are easy to overlearn,this paper proposes an improved hybrid ensemble method bEnsemble to increase the diversity of learners in the hybrid ensemble and reduce the variance of the model.The bEnsemble algorithm first uses the bootstrap method to sample the positive samples with a sampling rate of r,and randomly undersamples the negative samples to obtain the same number of sample subsets.Then it merges these two sample subsets into a data-balanced training set and uses XGBoost to train to get a base classifier.Finally,it repeats the above operation T times to integrate all the base classifiers to get the final ensemble classifier.For the highly imbalanced data set,this paper proposes the sEnsemble algorithm based on the bEnsemble combined with the SMOTE oversampling method.It uses a SMOTE oversampling with an upsampling rate of N before the bEnsemble algorithm,which increases the sample diversity of the training set and also increases the number of samples involved in the training of the base classifier in each round.In terms of theory,this paper analyzes the time complexity,variance-bias balance and error-divergence decomposition of the two newly proposed algorithms.Finally,this article uses F1-measure,G-mean and AUC these three indicators and Friedman test to measure the performance of the algorithm,and designed two experiments.A comparative experiment was designed from the two improve-ment directions of Bagging and Boosting to verify the correctness and necessity of the two improvement directions.In addition,This article designs a comparative experiment with other classic algorithms,and give suggestions for the param-eter adjustment of the new algorithm.The results show that bEnsemble and sEnsemble are generally better than other algorithms under the three indicators,verifying the performance of the new algorithm under different data sets and different evaluation indicators.
Keywords/Search Tags:Classification of Imbalanced Data, Bootstrap, SMOTE Oversampling, XGBoost, Hybrid Ensemble
PDF Full Text Request
Related items