Objectives This thesis studied the prediction problem of“category imbalance”in diabetes diagnosis and blood glucose control for middle-aged and elderly population,and used resampling algorithm to improve the prediction performance of classification models.The aim was to provide theoretical and data support for clinical workers to carry out diabetes diagnosis and blood glucose control for middle-aged and elderly population.Methods Based on the cohort data of“China Health and Retirement Longitudinal Study(CHARLS)”,this thesis respectively selected 5261 and 155 cases in terms of the inclusion and exclusion criteria for diabetes diagnosis and glycemic control in middle-aged and elderly population.Relevant data about socio-demographics,lifestyle,physical examination and blood test of the study subjects were collected,and the missing values of continuous and categorical independent variables were filled with mean and clustering algorithm,respectively.The Chi square test,t test and rank sum test were used for single factor screening that may affect diabetes diagnosis and blood glucose control,and LASSO-logistic(the Least Absolute Shrinkage and Selection Operator for Logistic Regression)was used for multi-factors screening.The statistically significant varibles of LASSO-logistic were selected as predictor variables,and whether diabetes was onset and whether blood glucose was controlled as outcome variables,respectively.RUS,ove-rsampling(SMOTE,ADASYN)and mixed sampling(SMOTEENN,SMOTETomek)were used to equalize the training data set,and three classification models of logistic regression,SVM and RF were introduced to predicte the onset and glycemic control of diabetes,respectively.Based on the training set,the optimal parameters were determined with hierarchical 5-fold cross validation and AUC.In order to analyze the influence of resampling algorithms on the performanc of classify models,the evaluation metrics of accuracy,sensitivity,specificity,precision,G-means,F-measure(F1 score)and AUC were employed to compare the performance of classification models for original data and resampling data.Results 1.The risk predictions of diabetes onset for middle-aged and elderly people:(1)There were 11 possible influencing factors,among which the risk factors were smoking,alcohol consumption(more than once a month),high level of systolic pressure(mm Hg),BMI(kg/m~2),TG(mg/dl),glucose(mg/dl),uric acid(mg/dl),C-reactive protein(mg/L)and glycated hemoglobin(%);protective factors were adequate sleep(h)and high levels of HDL-C(mg/dl).(2)For the diabetes imbalance dataset,the accuracy of the logistic,SVM,and RF classification models were 95.50%,96.33%,and96.33%,the sensitivity were 5.17%,0,and 0,the specificity were 98.95%,100%,and100%,the G-means were 0.2262,0,and 0,and the AUC were 0.7235,0.7196,and0.6990,respectively.(3)Several resampling algorithms mostly improved the sensitivity,G-means and F1 scores of logistic,SVM,and RF classification models.SMOTE,SMOTEENN,and SMOTETomek improved the AUC of the three classification models to different degrees(P<0.05).Compared with logistic,SVM,RF imbalance classification models,SMOTE under any sampling rate improved the AUC of logistic and SVM classification model,SMOTEENN increased the AUC of logistic and SVM classification model by 1.32%,2.63%,respectively.SMOTETomek increased the AUC of RF classification models by 4.94%.RUS and ADASYN do not significantly improved the AUC of the classification model.2.The predictions of glycemic control in middle-aged and elderly diabetic patients:(1)There were 9 possible influencing factors,among which,the risk factors were advanced age,disease course≥2 years,hypertension,overweight and obesity,elevated TG and reduced HDL-C,and protective were urban,exercise and having physician’s advice.(2)For the imbalanced glycemic control dataset,the accuracy of logistic,SVM,and RF classification models were 83.67%,83.67%,and 73.67%,the sensitivity were12.50%,0 and 0,the specificity were 97.56%,100%and 100%,the G-means values were 0.3493,0 and 0,and the AUC values were 0.7226,0.7012 and 0.6662,respectively.(3)Several resampling algorithms can improve the sensitivity,G-means,and F1 scores of logistic,SVM,and RF classification models.ADASYN,SMOTEENN,and SMOTETomek improved the AUC values of the three classification models to different degrees(P<0.05).Compared with logistic,SVM and RF unbalance classification models,ADASYN increased the AUC of logistic classification model by 2.13%,and SMOTEENN increased the AUC of logistic classification model by 3.05%.SMOTETomek increased the AUC of RF classification model by 2.13%;RUS and SMOTE cannot significantly increase the AUC of the classification model.Conclusions 1.The imbalanced data of diabetes onset had an important impact on the classification model,and three classifiers constructed based on the original data cannot identify the diabetic patients better.The SMOTE,SMOTEENN,and SMOTE-Tomek algorithms can better handle the problem of unbalanced diabetes data and improve the predictive performance of diabetes classification models.2.The imbalance data of glycemic control of diabetic patients had an important impact on the classification model,and the three classifiers constructed based on the original data cannot better identify the population with poorly controlled glucose.The ADASYN,SMOTEENN,and SMOTETomek can better handle the problem of imbalanced data of blood glucose control in diabetic patients and improve the predictive performance of the classification model of blood glucose control in diabetic patients. |