Class imbalance of datasets may lead to low prediction accuracy of machine learning algorithms for the samples of minority classes,which will lead to poor overall performance of the classifier.In order to overcome this problem,some oversampling techniques for the training set have been proposed recently,including the Mahalanobis Distance-based Over-sampling(MDO)algorithm appliable for the multi-classification tasks.At the same time,class imbalance of label distribution datasets has not been systematically studied,to our knowledge.In this paper,a modified MDO algorithm,called Modified Mahalanobis Distance-based Over-sampling(MMDO)algorithm is proposed to deal with the shortcomings of MDO algorithm.In addition,an oversampling algorithm for label distribution is designed to solve the class imbalance of label distribution datasets.Numerical experiments verify the effectiveness of the two proposed oversampling algorithms.Firstly,for introducing the application of the oversampling algorithm in the classifier modeling process and the related background,this paper systematically presents the modeling process of machine learning model,the basic theory of label distribution learning and the ambiguous diagnosis of lumbar disc degeneration,and represents the modeling idea and optimization method of J48,JRIP and SVM classifiers.The prediction process of AA-KNN classifier for label distribution is also described.Secondly,a mathematical description of MDO algorithm is rigorously expressed,based on which,some defects of MDO algorithm are found,such as,the synthetic instances synthetized by MDO algorithm may be unevenly distributed,and in MDO algorithm,each new synthetic instance is not used to update the candidate instance set,and the hyperelliptic equation may not be solved in the real number.In order to solve these problems one by one,an improved MMDO algorithm is proposed.Three kinds of classifiers(J48,JRIP and SVM)are modeled on six common datasets before and after oversampling.Numerical experients show the MMDO algorithm can produce better classification performance than both the unoversampling cases and the MDO algorithm by using the evaluation metrics of R,Average Recall,G-mean and MAUC.Finally,in order to study the class imbalance for label distribution learning,the concept of class imbalance of label distribution datasets is established according to a skewness of the label distribution sample.Afterwards,an oversampling algorithm for label distribution is then proposed.Numerical experiments based on AA-KNN classifier on the lumbar disc degeneration sample set verify that the new oversampled algorithm can improve the performance of the classifier,confirming effectiveness of the new oversampling algorithm. |