Research Of Imbalanced Data Classification Method Based On Oversampling And Ensemble Learning

Posted on:2020-02-02

Degree:Master

Type:Thesis

Country:China

Candidate:L J Li

Full Text:PDF

GTID:2428330590471706

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Imbalanced data classification is common in industry and people's daily life,such as medical diagnosis,spam filtering,credit card fraud,etc.Effectively solving the imbalanced data classification can be early warning or prediction,which has important research significance and practical value.Traditional classification models mostly use balanced datasets for training,and pursue the overall classification accuracy.However,for imbalanced datasets,the traditional classification model is not satisfactory.Imbalanced data consists of the majority class and the minority class.Imbalanced data classification solutions include undersampling methods and oversampling methods in data-level,cost-sensitive methods and ensemble learning methods in algorithm-level.At present,the oversampling method has the problem of synthesizing overlapping samples and over-fitting,while the ensemble learning method mostly adopts single-layer ensemble learning,and does not select the appropriate classification threshold according to the characteristics of the dataset.In order to solve this problem,this paper first studies the oversampling method at the data-level,and proposes a weighted oversampling based on Hierarchical Clustering(WOHC).The method first clusters the minority class,and checks the sample composition of the minority class clusters after clustering,avoids synthesizing overlapping or noise samples,and then determines the sampling size of each cluster based on the density of the minority class clusters,and the sampling weights of the minority class samples are determined by the distance between the minority class samples in the class cluster and the majority class boundaries,and the oversampling is finally completed in each synthesis region.The sampling method combined with the traditional classifier to perform experiments on several real datasets shows that the classification effect of the traditional classifier on imbalanced data can be effectively improved.Based on the above oversampling method,combined with the ensemble learning method,a two-layer ensemble learning method is designed.Firstly,Adaboost is used as the outer layer ensemble learning framework,and Random Forest is used as the base classifier of Adaboost algorithm.Performing WOHC sampling on the imbalanced dataset,and using the sampled dataset to training base classifiers,deleting the misclassified synthetic samples in each training process of Adaboost,and generating a corresponding number of synthetic samples by using WOHC,and the optimal classification threshold of Adaboost is adaptively selected by OTSU algorithm,and finally experiments are carried out on several real datasets.The results show that the classification algorithm is better than other ensemble classification algorithms.

Keywords/Search Tags:

Imbalanced data, Oversampling, Hierarchical Clustering, Ensemble Learning, Adaboost

PDF Full Text Request

Related items

1	Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling
2	Research On Ensemble Classifying Algorithm Of Imbalanced Date Set Based On Oversampling
3	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
4	The Research Of Imbalanced Data Based On Oversampling Technique
5	Two-class Imbalanced Big Data Classification Based On Data Reduction And Ensemble Learning
6	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
7	Two-class Imbalanced Data Classification Based On Diverse Data Generation And Ensemble Learning
8	Research And Application Of Imbalanced Data Classification Based On Oversampling Algorithm
9	Research On Predictive Maintenance Model For Imbalanced Industrial Data
10	Research On Ensemble Learning Algorithm For Imbalanced Data