Classification In Imbalanced Data Based On Over-Sampling And Ensemble Learning

Posted on:2018-05-04

Degree:Master

Type:Thesis

Country:China

Candidate:T Wang

Full Text:PDF

GTID:2348330515460064

Subject:Probability theory and mathematical statistics

Abstract/Summary:

PDF Full Text Request

Imbalanced data has increasingly become a popular research topic in the field of statistical machine learning.At present,the popular statistical machine learning theory and the existing classification algorithm are mostly based on the fact that the amount of sample data is roughly equal,commencing all kinds of statistical inference or analysis.However,these existing classical methods,once applied in imbalanced data,would produce a serious biased phenomenon,making the recognition rate of the minority class quite low.Nevertheless,people concern more about the information of the minority class in the application of the reality.Therefore,the improvement of the recognition rate of the minority class embraces the theoretical and practical significance.This paper improves the traditional classification algorithm from two aspects.1.From the data level,BOS sampling method is being proposed.The method is based on the nonparametric statistical Bootstrap sampling method.In each sample construction process,we take a small number of sub-sample set,calculating the expected value as a new sample.Therefore,the sample size would be extended,reducing the imbalance between classes.Experiments show that the sampling method has been improved in metrics compared with the classical SMOTE algorithm.The constructive samples of BOS algorithm are more effective especially when the number of samples needed to be expanded is small.2.From the algorithm level,Ort statistics and Im-AdaBoost algorithm are being proposed.In this paper,we analyze the weight update process of AdaBoost algorithm,and point out it only distinguishes whether the classification is correct,but not distinguishes the positive and negative classes.In addition,we analyze the influence of the diversity of the classifier on the generalization ability of the ensemble learning,and put forward the orthogonal diversity statistics.Based on the above two aspects,this paper gives the Im-AdaBoost algorithm for imbalanced data.AdaBoost is a special case of Im-AdaBoost algorithm when parameter s = 1.The upper bound of the generalization error of this algorithm is consistent with the AdaBoost algorithm,which is the continued product of the normalization factor when the weight of each round is updated.Experiments show that F1 and g metrics are enhanced in the improved algorithm,compared with AdaBoost classification algorithm.

Keywords/Search Tags:

Bootstrap Resampling, Nonparametric statistics, Ensemble Learning

PDF Full Text Request

Related items

1	Research And Application Of Ensemble Learning Based On Combined Resampling Methods
2	Unbalanced Data Classification Based On Resampling And Hybrid Ensemble
3	Bootstrap resampling in wavelet analysis and statistical methodologies in ecological research
4	Research On Ensemble Learning Approaches To Imbalanced Data Sets
5	Research On Imbalanced Data Classification Methods Based On Resampling And Ensemble Learning
6	Study Of Face Recognition Methods Based On Resampling Technology
7	Research On Classifier Ensemble
8	Step density estimation and bootstrap resampling
9	Comparative properties of nonparametric statistics for the analysis of the 2 x c layout for ordinal categorical dat
10	Resampling and distribution of the product methods for testing indirect effects in complex models