Font Size: a A A

An Empirical Study On Data Sampling Of Unbalanced Classification

Posted on:2021-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:K DaiFull Text:PDF
GTID:2427330605457331Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
To get the most accurate classification effect is always the research purpose of machine learning method.Most of machine learning classification models are generally designed according to the balanced data,and the model only pursues the overall classification performance.In the field of data mining for classification problems,unbalanced data is common.Directly training unbalanced data sets leads to a decrease in the accuracy of prediction for a few classes of data sets.Even when the data is extremely unbalanced,the model will divide most of the minority classes into many classes,while the class with less data in the unbalanced data is often the object that needs to be focused on Several kinds of mistakes will cause unpredictable consequences.In view of the imbalance of data,this paper describes the commonly used data sampling methods,subdivides the advantages and disadvantages of under sampling,over sampling and mixed sampling methods,and analyzes them from the data level.In this paper,the proportion of positive and negative samples is close to 1:11 P2P lending data,and machine learning classifier is used to model.Firstly,the basic information of loan,user profile and business are visualized and analyzed,and the characteristic attributes with high correlation of default are found.The investors are advised to be effective,and the investors are cautious to invest,so as to reduce the loss of customers caused by default risk.Then,the data is preprocessed to delete the features with large missing proportion and no practical significance to the overall model,so as to make the features effective Through the correlation analysis of features,the features with high correlation with target variables are selected to improve the upper limit of model performance.Select the classification model,compare the gap between the methods of not processing and data sampling,analyze the influence of data under sampling method,over sampling method and mixed sampling method on the model evaluation index,whether the comprehensive detection ability of positive samples is improved,and whether the overall classification performance of the model is improved.The results show that data sampling can improve the comprehensive classification performance of the model,and generally over sampling is better than the other two sampling methods.Under sampling reduces a large number of samples of most classes,loses most of the sample information,and reduces the classification accuracy of the model.Of course,over sampling will also have the risk of over fitting a small number of classes when generating new samples of a small number of classes,but most of the over sampling methods in the examples perform well,and the effect is generally better than under sampling and mixed sampling.Therefore,when choosing data to sample and balance training set data,over sampling method should be preferred.At last,comparing the single logistic regression classifier with the integrated random forest classifier,it is found that the integrated model is better than the single weak classifier,so combining the over sampling and random forest to build a model for classification.
Keywords/Search Tags:Unbalanced classification, machine learning, data sampling, oversampling, integrated classification
PDF Full Text Request
Related items