Font Size: a A A

Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling

Posted on:2019-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:F F ZhangFull Text:PDF
GTID:2428330545459668Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the explosion of Big Data,there are much more imbalanced in the fields,such as credit card fraud detection,bank bankruptcy prediction,medical diagnosis and so on.There are serious imbalance classes in these datasets.It is the top priority to improve the accuracy of classification and improve the performance of classifiers in data mining and machine learning.The thesis attempts to filter the original dataset through noise processing,also a new method of data balance processing is proposed.At the same time,the improved oversampling algorithm is combined with AdaBoost to improve the classification for imbalanced data from data level and algorithm level,the results show the feasibility and effectiveness of the proposed method.The main research contents of the thesis are:The thesis has summarized the oversampling methods.Based on the sub-cluster and probability distribution,a new model(SDPD-SMOTE)is proposed.This method uses majority samples information to divide minority samples for different sub-clusters,also uses sub-clusters to get the probability of different sub-clusters to perform the oversampling task.On the one hand,the oversampling method selects “seed samples” and adopts random selection method when oversampling in order to ensure that the synthesized samples are randomness,and can better simulate the distribution of real data.On the other hand,oversampling is used to allocate oversampled weights to all the minority sub-clusters,in order to avoid serious overcoverage to some sub-clusters offsets,and realizes the balance of training information in the class.Experiments show that the proved oversampling method SDPD-SMOTE can achieve better results.Another work of the thesis is to combine improved over sampling with AdaBoost and proposes a SDPDBoost classification model.This model combines the advantages of AdaBoost and oversampling,using the improved sampling method to synthesis of new sample data balance to some extent,and corrects in a timely manner to ensure their quality after oversampling.At the same time,the AdaBoost algorithm has higher classification accuracy and better generalization ability.Decision tree is used as a basic classifier.Each iteration uses initial oversampling method to synthesize samples,and the training information can be balanced to get the final classification model.The accuracy and classification performance of the model is better than that of other models by comparing the model with other classification models.
Keywords/Search Tags:Imbalanced data, noise processing, oversampling, decision tree, AdaBoost
PDF Full Text Request
Related items