Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling

Posted on:2019-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:F F Zhang

Full Text:PDF

GTID:2428330545459668

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As the explosion of Big Data,there are much more imbalanced in the fields,such as credit card fraud detection,bank bankruptcy prediction,medical diagnosis and so on.There are serious imbalance classes in these datasets.It is the top priority to improve the accuracy of classification and improve the performance of classifiers in data mining and machine learning.The thesis attempts to filter the original dataset through noise processing,also a new method of data balance processing is proposed.At the same time,the improved oversampling algorithm is combined with AdaBoost to improve the classification for imbalanced data from data level and algorithm level,the results show the feasibility and effectiveness of the proposed method.The main research contents of the thesis are:The thesis has summarized the oversampling methods.Based on the sub-cluster and probability distribution,a new model(SDPD-SMOTE)is proposed.This method uses majority samples information to divide minority samples for different sub-clusters,also uses sub-clusters to get the probability of different sub-clusters to perform the oversampling task.On the one hand,the oversampling method selects �seed samples� and adopts random selection method when oversampling in order to ensure that the synthesized samples are randomness,and can better simulate the distribution of real data.On the other hand,oversampling is used to allocate oversampled weights to all the minority sub-clusters,in order to avoid serious overcoverage to some sub-clusters offsets,and realizes the balance of training information in the class.Experiments show that the proved oversampling method SDPD-SMOTE can achieve better results.Another work of the thesis is to combine improved over sampling with AdaBoost and proposes a SDPDBoost classification model.This model combines the advantages of AdaBoost and oversampling,using the improved sampling method to synthesis of new sample data balance to some extent,and corrects in a timely manner to ensure their quality after oversampling.At the same time,the AdaBoost algorithm has higher classification accuracy and better generalization ability.Decision tree is used as a basic classifier.Each iteration uses initial oversampling method to synthesize samples,and the training information can be balanced to get the final classification model.The accuracy and classification performance of the model is better than that of other models by comparing the model with other classification models.

Keywords/Search Tags:

Imbalanced data, noise processing, oversampling, decision tree, AdaBoost

PDF Full Text Request

Related items

1	Research Of Imbalanced Data Classification Method Based On Oversampling And Ensemble Learning
2	Research On Ensemble Classifying Algorithm Of Imbalanced Date Set Based On Oversampling
3	Numerical Analysis And Algorithm Improvement Of Imbalanced Data Based On Decision Tree
4	Research And Application Of Imbalanced Data Classification Based On Oversampling Algorithm
5	Research On Cover-based Algorithms For Oversampling On Imbalanced Data
6	Processing And Identification Methods Of Imbalanced Financial Transaction Data
7	Improved Methods Of Oversampling And Feature Selection Based On Imbalanced Data
8	Insurance Cross-selling Prediction Based On Imbalanced Data
9	Research Of Imbalance Data Over-sampling Technique Based On Three-way Decisions
10	Research On Imbalanced Datasets Classification Based On Machine Learning And Oversampling Methods