Font Size: a A A

Research On Imbalanced Data Classification Algorithms Based On Ensemble Learning

Posted on:2019-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:R X WangFull Text:PDF
GTID:2428330548958872Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Classification is one of important means of knowledge acquisition in data mining and machine leaning,and classical classification algorithms are usually proposed based on the assumption that the dataset is balance.But in practical applications,many datasets are not balance,sometimes the minority class data is even more important,and it will pay higher price for their misclassification,such as credit card fraud detection,medical diagnosis and spam identification.Therefore,the traditional classification algorithms based on the overall classification accuracy are not suitable for the classification of imbalanced data.It is of great significance to study how to improve the classification effect of the classifier on imbalanced data.Ensemble learning generally has a good classification effect and the data subset of individual learners can be combined with the resampling technique of imbalanced data classification.Therefore,this paper mainly studies ensemble learning applied to the classification of imbalanced data.In this paper,several novel algorithms have been proposed from changing the samples distribution of the datasets by oversampling technique to make it balance,combining mixed sampling based ensemble leaning and improving classification algorithm,and transforming the imbalanced data classification problem into anomaly detection problem.The main work of this paper is as follows:(1)This paper researches on the resampling technique to make the datasets balance,because synthetic minority oversampling technique(SMOTE)algorithm and borderline synthetic minority oversampling technique(BSMOTE)algorithm in the oversampling process is not taking into account the differences between the minority class samples,and the number of samples to be synthesized is chosen randomly,which leads to some blindness.In this paper,an adaptive borderline minority oversampling technique(ABSMOTE)is proposed which considers the adjacent average distance and the number of adjacent samples in the majority class samples of the boundary samples in the minority class samples,and using UCI datasets show that ABSMOTE algorithm can improve the classification effect of the classifier for imbalanced data.(2)In order to increase the diversity of individual learners in ensemble learning and improve the classification effect,this paper improves the weight updating process of AdaBoost(Adaptive Boosting)algorithm firstly and proposes an improved Ada Boost ensemble based mixed sampling with different sampling rates(IAE-MSD)algorithm.The oversampling part uses the ABSMOTE algorithm,the under sampling part uses a layered under sampling algorithm based priori.The overall resampling part reduces the negative influence of the noise data,and retains the original distribution of the datasets.Besides using different sampling rates so that each subset of data is approximately balanced,the algorithm taking into account the extreme cases of oversampling and under sampling only,moreover the difference in the number of samples of adjacent data subset is roughly the same.In addition,the algorithm uses the improved AdaBoost algorithm as base classifiers.Using the UCI datasets show that IAE-MSD algorithm can improve the classification effect of the classifier for imbalanced data.(3)In this paper,the minority class data are regarded as abnormal data,and the problem of imbalanced data classification is transformed into an anomaly detection problem.iForest(isolation Forest)algorithm has a low detection ability for local outlier detection,and the detection time of LOF(Local Outlier Factor)algorithm is longer,and an improved algorithm which can solve these problems named iForest-WHT(isolation Forest Based on Waterfall Hybrid Technology)algorithm is proposed.Taking synthetic dataset and real datasets in UCI as the research object,it is proved by experiment that the algorithm can improve the effect of anomaly detection.
Keywords/Search Tags:imbalanced data classification, resampling technique, ensemble learning, sampling rates, anomaly detection
PDF Full Text Request
Related items