Font Size: a A A

Research Of Imbalanced Data Classification Method Based On Oversampling And Ensemble Learning

Posted on:2020-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:L J LiFull Text:PDF
GTID:2428330590471706Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced data classification is common in industry and people's daily life,such as medical diagnosis,spam filtering,credit card fraud,etc.Effectively solving the imbalanced data classification can be early warning or prediction,which has important research significance and practical value.Traditional classification models mostly use balanced datasets for training,and pursue the overall classification accuracy.However,for imbalanced datasets,the traditional classification model is not satisfactory.Imbalanced data consists of the majority class and the minority class.Imbalanced data classification solutions include undersampling methods and oversampling methods in data-level,cost-sensitive methods and ensemble learning methods in algorithm-level.At present,the oversampling method has the problem of synthesizing overlapping samples and over-fitting,while the ensemble learning method mostly adopts single-layer ensemble learning,and does not select the appropriate classification threshold according to the characteristics of the dataset.In order to solve this problem,this paper first studies the oversampling method at the data-level,and proposes a weighted oversampling based on Hierarchical Clustering(WOHC).The method first clusters the minority class,and checks the sample composition of the minority class clusters after clustering,avoids synthesizing overlapping or noise samples,and then determines the sampling size of each cluster based on the density of the minority class clusters,and the sampling weights of the minority class samples are determined by the distance between the minority class samples in the class cluster and the majority class boundaries,and the oversampling is finally completed in each synthesis region.The sampling method combined with the traditional classifier to perform experiments on several real datasets shows that the classification effect of the traditional classifier on imbalanced data can be effectively improved.Based on the above oversampling method,combined with the ensemble learning method,a two-layer ensemble learning method is designed.Firstly,Adaboost is used as the outer layer ensemble learning framework,and Random Forest is used as the base classifier of Adaboost algorithm.Performing WOHC sampling on the imbalanced dataset,and using the sampled dataset to training base classifiers,deleting the misclassified synthetic samples in each training process of Adaboost,and generating a corresponding number of synthetic samples by using WOHC,and the optimal classification threshold of Adaboost is adaptively selected by OTSU algorithm,and finally experiments are carried out on several real datasets.The results show that the classification algorithm is better than other ensemble classification algorithms.
Keywords/Search Tags:Imbalanced data, Oversampling, Hierarchical Clustering, Ensemble Learning, Adaboost
PDF Full Text Request
Related items