Font Size: a A A

Classification In Imbalanced Data Based On Over-Sampling And Ensemble Learning

Posted on:2018-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:T WangFull Text:PDF
GTID:2348330515460064Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Imbalanced data has increasingly become a popular research topic in the field of statistical machine learning.At present,the popular statistical machine learning theory and the existing classification algorithm are mostly based on the fact that the amount of sample data is roughly equal,commencing all kinds of statistical inference or analysis.However,these existing classical methods,once applied in imbalanced data,would produce a serious biased phenomenon,making the recognition rate of the minority class quite low.Nevertheless,people concern more about the information of the minority class in the application of the reality.Therefore,the improvement of the recognition rate of the minority class embraces the theoretical and practical significance.This paper improves the traditional classification algorithm from two aspects.1.From the data level,BOS sampling method is being proposed.The method is based on the nonparametric statistical Bootstrap sampling method.In each sample construction process,we take a small number of sub-sample set,calculating the expected value as a new sample.Therefore,the sample size would be extended,reducing the imbalance between classes.Experiments show that the sampling method has been improved in metrics compared with the classical SMOTE algorithm.The constructive samples of BOS algorithm are more effective especially when the number of samples needed to be expanded is small.2.From the algorithm level,Ort statistics and Im-AdaBoost algorithm are being proposed.In this paper,we analyze the weight update process of AdaBoost algorithm,and point out it only distinguishes whether the classification is correct,but not distinguishes the positive and negative classes.In addition,we analyze the influence of the diversity of the classifier on the generalization ability of the ensemble learning,and put forward the orthogonal diversity statistics.Based on the above two aspects,this paper gives the Im-AdaBoost algorithm for imbalanced data.AdaBoost is a special case of Im-AdaBoost algorithm when parameter s = 1.The upper bound of the generalization error of this algorithm is consistent with the AdaBoost algorithm,which is the continued product of the normalization factor when the weight of each round is updated.Experiments show that F1 and g metrics are enhanced in the improved algorithm,compared with AdaBoost classification algorithm.
Keywords/Search Tags:Bootstrap Resampling, Nonparametric statistics, Ensemble Learning
PDF Full Text Request
Related items