Font Size: a A A

Research On Traditional Classification Model Based On Unbalanced Data

Posted on:2020-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhaoFull Text:PDF
GTID:2428330578473083Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of modern technology and the data services increasing,a large amount of data is accumulating and data types are becoming more diverse.The unbalanced data is representative and has appeared in medical,financial,insurance,biological and other related fields nowadays,which makes it difficult to classify and forecast actual business data in these fields.Since the classified prediction of traditional classifiers is mostly based on the equilibrium data set,the minority samples may be neglected and the classification performance be degraded as the result of the non-equilibrium data.This paper studies the related problems of how to deal with unbalanced data,mainly based on the data level and algorithm level,to enhance the application value of traditional classification models on unbalanced data sets by improving the classification performance of traditional classification models based on unbalanced data.The main contributions are summarized as follows:?1?At the data level,the combined sampling method based on SMOTEEN+F is proposed.As this method adds the idea of the ensemble,proposed by Easy Ensemble,to the over-sampling method based on SMOTE,it improves the sample unbalance and the classification performance of the traditional classification model by viewing the traditional classification model as a submodel and using the F1-value value as the weight that reflects the classification performance of a few classes of samples.According to experiments with UCI datasets,the new method improves the classification performance of BP-neural networks,Support Vector Machines?SVM?and Logistic classification models for unbalanced data.?2?At the algorithm level,the Logistic classified model based on unbalanced data is studied.In order to solve the problem that the default threshold cannot reasonably divide the categorical variables when dealing with unbalanced data in Logistic classified model,this paper proposes a method,called confidence threshold method,to determine classified threshold.With the method calculating the confidence of each sample in the unbalanced data,and applying it to the default threshold of 0.5 to make the threshold carry the sample information,the Logistic classification model can effectively deal with the classified problem of unbalanced data.The rationality of the confidence threshold method is also demonstrated by the UCI dataset.Finally,this paper uses the new proposed methods above to study whether the customer is overdue in the unbalanced credit data,based on the BP-neural network classifier,SVM classifier and Logistic classifier.It concludes that the traditional classification model improves the classification accuracy of the credit data and better identify the overdue credit data,and the SVM credit scoring model based on SMOTE-EN+F is obtained,by processing the unbalanced credit data with SMOTE-EN+F method compared with the oversampling method based on SMOTE;and the proposed confidence threshold method can improve the applicability of the Logistic classifier based on the actual unbalanced credit data set.
Keywords/Search Tags:Unbalanced data, Oversampling, Under-sampling, BP-Neural Networks, Support Vector Machines, Logistic, Threshold
PDF Full Text Request
Related items