ObjectiveTo illustrate the performance of Stacking model in building a prospective risk assessment model for type 2 diabetes in the middle-aged and elderly people in China.To explain the decision process of complex machine learning algorithms using multiple explainable analysis methods.To provide the theoretical basis and technical support for the application of Stacking method and explainable analysis in the prevention and control of type 2 diabetes in the middle-aged and elderly people.MethodsThe data were collected from 2011 and 2015 China Health and Retirement Longitudinal Study,among which 8 063 middle-aged and elderly people were selected as subjects.Social-demographic characteristics,body measurements,biochemical indicators and other factors measured by questionnaire,physical examination and laboratory examination were collected as predictors.Logistic regression was used to explore the association between the predictors and type 2 diabetes in middle-aged and elderly people.The predictors were included as combination I(Continuous variables:age,systolic blood pressure,diastolic blood pressure,waist circumference,high-density lipoprotein cholesterol,triglyceride,glycosylated hemoglobin,fasting blood glucose.Categorical variables:gender,BMI and self-reported hypertension)and predictor combination II(Categorical variables:age,gender,BMI,hypertension prevalence,waist circumference,HDL cholesterol,triglycerides,glycosylated hemoglobin,fasting blood glucose)were included in Logistic regression analysis.Machine learning models were trained with Python 3.7.6 and R language 4.2.1 software and verified internally.A 10-fold cross validation was used to divide training set and test set.Unbalanced data were processed in the training set to train Logistic regression,random forest,LightGBM,and Stacking model.Random forest and LightGBM were optimized with the random search method combined with five-fold cross validation,and the optimized models were used in the Stacking method.Test sets were used to validate model.The Area Under the Receiver Operating Characteristic Curve(AUC),sensitivity,specificity,accuracy and Brier score were calculated with the default cut-off value(0.5).Using models’ feature importance,Shapley Additive explanations(SHAP)methods,and LIME(interpretable model-agnostic explanation)explained the decision process of the random forest,LightGBM,and Stacking model.The Stacking model only used LIME method for local interpretability analysis.Finally,a web page is used to display and apply the risk assessment model.ResultsA total of 8 063 subjects were included in the study,of which 1 088(about 13.5%)participants developed type 2 diabetes in 2015.Logistic regression results based on predictor combination I showed that age(HR=1.018,95%CI:1.009~1.026,P<0.001),BMI≥ 28.0kg/m2(HR=1.702,95%CI:1.342~2.160,P<0.001),waist circumference(HR=1.012,95%CI:1.005~1.020,P=0.001),triglyceride(HR=1.001,95%CI:1.000~1.002,P=0.004),glycosylated hemoglobin(HR=2.221,95%CI:1.876~2.630,P<0.001),fasting blood glucose(HR=1.013,95%CI:1.007~1.019,P<0.001)were risk factors for type 2 diabetes in middle-aged and older adults.Self-reported hypertension(HR=0.780,95%CI:0.667~0.912,P=0.002)was a protective factor for type 2 diabetes in middle-aged and older adults.Logistic regression results based on predictor combination Ⅱ showed that people aged 50~64 years(HR=1.385,95%CI:1.150~1.667,P=0.001)and≥65 years(HR=1.47,95%CI:1.418~2.152,P<0.001),hypertension(HR=1.330,95%CI:1.144~1.546,P<0.001),BMI ≥28.0kg/m2(HR=1.954,95%CI:1.587~2.407,P<0.001),central obesity(HR=1.464,95%CI:1.232~1.739,P<0.001),glycosylated hemoglobin levels were 5.7~6.4%(HR=1.366,95%CI:1.195~1.562,P<0.001),fasting blood glucose levels were 100~125mg/dl HR=2.257,95%CI:1.794~2.841,P<0.001)were risk factors for type 2 diabetes in middle-aged and older adults.Machine learning models were constructed based on predictor combination I and predictor combination II,respectively.Among them,the Stacking model based on predictor combination I had an average AUC of 0.662±0.046,an average sensitivity of 0.593±0.063,an average specificity of 0.642±0.019,an average accuracy of 0.635±0.021 and an average Brier score of 0.232±0.005 on the test sets.The AUC was better than Logistic regression,random forest and LightGBM.The sensitivity was better than random forest and LightGBM.Another stacking model built by predictor combination II still had a better average AUC of 0.634±0.018 and average sensitivity of 0.592±0.058 but all metrics were lower than the stacking model based on predictor combination I.Therefore,we chose the stacking model based on predictor combination I as the optimal model to subsequent analysis.Global interpretability analysis of the random forest,LightGBM,and Stacking model using the feature importance and SHAP method.Feature importance in Random Forest and LightGBM model showed that waist circumference,glycosylated hemoglobin,fasting blood glucose and triglyceride had high average relative importance in random Forest and LightGBM model and were in the top 5 rankings of importance of the variables in both models.Global interpretable analysis based on SHAP method showed that waist circumference,glycosylated hemoglobin level,fasting glucose level,mean systolic blood pressure,mean diastolic blood pressure,BMI,age,and triglyceride level were positively correlated with positive predictive outcomes of type 2 diabetes in Random Forest and LightGBM.The results of feature dependence plots of random forest and LightGBM were similar.Waist circumference,glycosylated hemoglobin,fasting blood glucose value,average systolic blood pressure level and BMI showing a non-liner association with SHAP value.A true positive case and a true negative case were analyzed for local interpretability.By summarizing the local interpretable analysis results of SHAP and LIME methods on random forest,LightGBM,and Stacking methods,it was found that for the true positive instance,higher fasting glucose,glycosylated hemoglobin,and triglycerides contribute more to the positive prediction.For the true negative instance,lower waist circumference,glycated haemoglobin,fasting glucose and BMI were unlikely to increase the risk of type 2 diabetes in the sample.The above results were consistent with the results of Logistic regression and global interpretability analysis.ConclusionBased on the results of Logistic regression and interpretative analysis,waist circumference,BMI,fasting blood glucose and glycosylated hemoglobin are important risk factors and predictors of future type 2 diabetes in middle-aged and elderly people.Despite the poor generalization ability of all the models constructed in this study,the research results still indicate that the Stacking model is more advantageous in the complex data.This explainable learning model integrating different machine learning algorithms can synthesize the advantages of different models,make up for the determination of each model,and get better results.The model is presented using a web page and the results of a partial interpretability analysis are embedded.This method can not only facilitate users to ignore the calculation process and directly obtain the prediction results,but also remind individuals of their risk factors,so as to facilitate the early implementation of preventive interventions to reduce the risk of disease. |