Font Size: a A A

Research On Classification Based On Unbalanced Data Sets

Posted on:2021-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:D X ZhangFull Text:PDF
GTID:2427330623465685Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The rapid development of information technology and the popularity of the Internet platform have enabled the Internet + various traditional industries to be more deeply integrated and applied.Using historical data in the past can better serve all walks of life.In real life,we will find In many data sets,data imbalance often exists,that is,there is a serious imbalance between most types of samples and a few types of samples.Usually,the focus of our research is on a small number of samples,such as in the medical field.Cancer patients only account for a minority of the total sample,and if these minority samples are ignored or misjudged,the losses and negative impacts on individuals,families or society as a whole are much higher than those of the majority.degree.In the traditional classifier learning in the past,for the classification problem,the overall classification accuracy is often the most important evaluation index.However,this evaluation method usually causes the classification learner to report to most classes in the imbalanced data.The sample is biased,thereby improving the classification accuracy of the overall sample and reducing the recognition rate for the minority class.The minority class sample is often the focus of attention.Therefore,such evaluation indicators are often unreasonable for the prediction classification of unbalanced data.This article uses real hospital patient data in Ohio as the original data set,which includes a total of 110466 sample data sets and 14 original feature fields.Because it is the original data set,the data set is first checked and processed for missing and abnormal data.For example,a series of processing is performed on the existing outliers with an age less than 0,and then a descriptive analysis is performed.A preliminary analysis is performed on the feature field to pave the way for the subsequent feature derivation,and then the original 14 features The fields are derived into 39 feature fields.Finally,the random forest algorithm and Catboost algorithm are used to rank the importance of the features,leaving 14 key feature fields.For the existing imbalanced data set,this paper mainly makes corresponding improvements and innovations from the three aspects of data sampling,classificationalgorithm selection,and evaluation indicators.In terms of data sampling,this paper proposes a new AK-SMOTE sampling method for data sampling.This method is a combination of SMOTE oversampling technology and ALLKNN undersampling technology.The AK-SMOTE sampling method can overcome the lack of excessive sample information due to undersampling processing,and also avoid oversampling processing.AK-SMOTE sampling method has a better processing effect than conventional single undersampling or oversampling methods,and can greatly improve the recognition rate for a few classes.In the selection of classification algorithms,this paper proposes a new LRC classification algorithm.This algorithm uses the Logistic regression algorithm,Random Forest algorithm,and Catboost algorithm as the base model,and uses the predicted values output by these three base models as the secondary learning classifier Logistic regression The new feature field of the model and the final classification prediction result are output.The results show that the effect of the LRC classification algorithm is better than that of other models.In terms of evaluation indicators,this article takes f1-metrics and Log-loss losses as the final evaluation indicators.f1-metrics are weighted harmonic averages.This indicator objectively balances the recall and accuracy of a small number of categories,while Log-loss The loss can more accurately evaluate the degree of fit between the model and the data,and the smaller the value,the better.Compared to the evaluation indicators of previous classifier models,the f1-metric and Log-loss loss are used as the final evaluation indicators.It seems more reasonable in dealing with unbalanced data.The new AK-SMOTE sampling method proposed in this paper performs data sampling on unbalanced data sets.It adopts a new classification algorithm of LRC and uses the f1-metric and Log-loss loss as the final evaluation index to deal with the imbalanced data problems in real life.,Has a certain practical reference significance.
Keywords/Search Tags:unbalanced data, AK-SMOTE sampling, LRC classification algorithm, f1-metrics, Log-loss
PDF Full Text Request
Related items