Research On Traditional Classification Model Based On Unbalanced Data

Posted on:2020-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:H Zhao

Full Text:PDF

GTID:2428330578473083

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

With the rapid development of modern technology and the data services increasing,a large amount of data is accumulating and data types are becoming more diverse.The unbalanced data is representative and has appeared in medical,financial,insurance,biological and other related fields nowadays,which makes it difficult to classify and forecast actual business data in these fields.Since the classified prediction of traditional classifiers is mostly based on the equilibrium data set,the minority samples may be neglected and the classification performance be degraded as the result of the non-equilibrium data.This paper studies the related problems of how to deal with unbalanced data,mainly based on the data level and algorithm level,to enhance the application value of traditional classification models on unbalanced data sets by improving the classification performance of traditional classification models based on unbalanced data.The main contributions are summarized as follows:?1?At the data level,the combined sampling method based on SMOTEEN+F is proposed.As this method adds the idea of the ensemble,proposed by Easy Ensemble,to the over-sampling method based on SMOTE,it improves the sample unbalance and the classification performance of the traditional classification model by viewing the traditional classification model as a submodel and using the F₁-value value as the weight that reflects the classification performance of a few classes of samples.According to experiments with UCI datasets,the new method improves the classification performance of BP-neural networks,Support Vector Machines?SVM?and Logistic classification models for unbalanced data.?2?At the algorithm level,the Logistic classified model based on unbalanced data is studied.In order to solve the problem that the default threshold cannot reasonably divide the categorical variables when dealing with unbalanced data in Logistic classified model,this paper proposes a method,called confidence threshold method,to determine classified threshold.With the method calculating the confidence of each sample in the unbalanced data,and applying it to the default threshold of 0.5 to make the threshold carry the sample information,the Logistic classification model can effectively deal with the classified problem of unbalanced data.The rationality of the confidence threshold method is also demonstrated by the UCI dataset.Finally,this paper uses the new proposed methods above to study whether the customer is overdue in the unbalanced credit data,based on the BP-neural network classifier,SVM classifier and Logistic classifier.It concludes that the traditional classification model improves the classification accuracy of the credit data and better identify the overdue credit data,and the SVM credit scoring model based on SMOTE-EN+F is obtained,by processing the unbalanced credit data with SMOTE-EN+F method compared with the oversampling method based on SMOTE;and the proposed confidence threshold method can improve the applicability of the Logistic classifier based on the actual unbalanced credit data set.

Keywords/Search Tags:

Unbalanced data, Oversampling, Under-sampling, BP-Neural Networks, Support Vector Machines, Logistic, Threshold

PDF Full Text Request

Related items

1	Research On Some Problems And Applications In Support Vector Machines
2	Research On Unbalanced Data Classification Based On Support Vector Mixed Sampling
3	Studies Of Several Mathematical Models And Algorithms Of Support Vector Machine
4	Some Research For Neural Networks And Support Vector Machines
5	The Research And Optimization On Support Vector Machines Algorithm
6	Classification Methods Based On Support Vector Machines And Manifold Learning
7	Research On Transient Stability Assessment Based On Neural Networks And Support Vector Machines
8	Dynamic learning with neural networks and support vector machines
9	An Improved Classification Algorithm Of SVM For Learning Unbalanced Datasets
10	Classification Research For Unbalanced Data Based On Hybrid-sampling