Font Size: a A A

The Sensitivity Of Logistic Regression To Unbalanced Data:Measurement,Correction And Comparison

Posted on:2017-06-30Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LvFull Text:PDF
GTID:2347330512974669Subject:Statistics
Abstract/Summary:PDF Full Text Request
In recent years,classification for unbalanced data sets has become a hot topic in the field of machine learning and data mining.The unbalanced data sets refer to that the samples of one class are less than the other one or others.And the class containing only a few samples is called rare calss,while the other one containing more samples is called majority class.The traditional machine learning algorithms perform badly on the rare class due to the imbalance in the class distribution.In reality,the cost of missing or misclassifying the rare class is usually much higher than others,so the accuracy of rare class's classification is paid more attention.To solve the problem,new methords for unbalanced data sets' classification emerged.Crrently,there are two different strategies involving to the above problems.One is to improve the traditional classification algorithms themselves,so that the new improved algorithms in classification are not only concerned with the overall effect,but also pay more attention to the classified accuracy of the rare class,such as cost-sensitive learning method,ensemble learning method,single-class learning method,feature selection method and the delineation of training sets method.The other one try to solve the problem by reconstructing original data sets,which uses different methods of sampling technology to balance the original datasets.As we all know,random sampling,one-sided selection(OSS)and Synthetic Minority Oversampling Technique(SMOTE)belong to the latter strategy.In addition,selecting proper performance measures is of great importance.More attention should be paid not only on the accuracy of a certain class in classification but also on the effect of overall classification,such as AUC value,geometric mean(G),F-measure(F)and ROC curve.The ROC curve is more straightforward than any other measures,which displays the two types of errors for all possible thresholds.With the development and improvement of machine learning and data mining,there has been a growing number of algorithms for classification,and the classification techniques are becoming increasingly sophisticated,such as discriminant analysis,Logistic model,KNN algorithm,decision tree and support vector machine.Actually,these classification algorithms have been used in wide range of areas and got good classification results.Based on the UCI database,this paper analyzes the sensitivity of the logistic model to different degree of unbalanced datasets,which has strong explanation and solidity.The research shows that:(1)The logistic model would be affected by the imbalance in the class distribution,and the higher the degree of unbalanced dataset is,the poorer ability the logistic model to identify the rare class.(2)Compared to other revised methods,such as random oversampling(ROS),random undersampling(RUS)and SMOTE,OSS method is not significant and stable;Simple sampling has better performance relative to complex sampling.(3)To measure the performance of the classification algorithms,five-Fold Cross-Validation has been constructed.It shows that,with respect to the Acc+ and G-mean,the AUC is not suitable for model selection under the condition of unbalanced data.Because it cannot distinguish the four corrected methods effectively nor tell the differences before and after correction.
Keywords/Search Tags:Logistic Model, Unbalanced Data, ROC Curve, AUC, Balanced 5-CV
PDF Full Text Request
Related items