| ObjectiveIn this study,an optimal model was constructed and screened out among several models to predict the efficacy of oral folic acid’ intervention to Hyperhomocysteinemia(HHcy)with high accuracy.Four new machine learning algorithms were used to construct different HHcy prediction models and compared with the traditional Logistic Regression(LR)algorithm.We also explored the impacts of traditional factor feature set,the genetic factor feature set,and the pooled feature set of the above two factors.Then we compared the efficacy of different division methods of training and validation sets.Lastly,we optimized the parameters of the predictive model.The final aim of our study was to provide a scientific basis for a more effective and accurate treatment to HHcy.MethodsPatients with HHcy who voluntarily underwent a 90-day(5 mg/d)oral folic acid intervention at the Fifth Affiliated Hospital of Zhengzhou University from July 2014 to December 2014 were initially selected,and 638 HHcy patients who had complete information and met the inclusion criteria were finally enrolled in this study.Based on the level of the patient’s plasma Homocysteine(Hcy)after the intervention,they were assigned to the treatment success group(Hcy<15μmol/L)and the failure group(Hcy≥15μmol/L).1.Based on LR analysis,meaningful independent variables were selected by univariate analysis and multivariable stepwise regression method successively.Then we got the different feature sets.2.Based on the feature set containing only genetic factors,the feature set containing only traditional factors,and the fused feature set containing both factors mentioned above,we constructed traditional LR algorithm models.And we calculated and output the evaluation indexes.Then the optimal feature set that was used to predict the efficacy of HHcy was selected by the evaluation indexes.3.Based on the optimal feature set,we constructed different algorithm models through LR algorithm and four new machine learning algorithms(eXtreme Gradient Boosting(XGBoost),Support Vector Machine(SVM),Random Forest(RF),and Artificial Neural Network(ANN)).And the optimal algorithm that was used to predict the efficacy of HHcy was selected by the evaluation indexes.4.Based on the optimal feature set and algorithm,the dataset was partitioned into training and validation sets using the hold-out method and k-fold cross-validation method randomly.Then screen out the optimal splitting method that was used to predict the efficacy of HHcy according to the evaluation indexes.5.Optimized the parameters of the predictive model to confirm the optimal predictive model that was used to predict the efficacy of HHcy in our study.Results1.Based on the LR analysis,the traditional factors enrolled in the construction of models and consisted of the traditional feature set were as follows:BMI,family history,history of hypertension,history of stroke,history of coronary heart disease,LDL-C,HDL-C,and baseline Hcy.And the meaningful genetic factors that consisted of the genetic feature set were as follows:MTHFR rs 1801133,MTHFR rs 1801131,MTHFD rs2236225,MTRR rs 1801394,CBS rs706209,and BHMT rs3733890.2.The efficacy of the model based on the fused feature set which combined traditional and genetic factors was better than that of the models based on the other two datasets(AUC:0.958,accuracy:92.1%,sensitivity:90.5%,specificity:91.3%,and Youden index:0.818).3.The efficacy of the models constructed using machine learning algorithms were all better than the traditional LR algorithm,among which,XGBoost had the best efficacy(AUC:0.941,accuracy:94.6%,sensitivity:94.7%,specificity:94.5%,and Youden index:0.892).4.The efficacy of the model was the best when the dataset was split by the holdout method according to the ratio of 7:3 between training and validation sets(validation set,AUC:0.964,accuracy:95.7%,sensitivity:94.9%,specificity:93.0%,and Youden index:0.879).5.Based on the fused feature set that was optimized already and the XGBoost algorithm(maximum depth:5,learning frequency:0.01,λ:0.3,and the other parameters are set to their default values),the dataset was split by the hold-out method according to the ratio of 7:3 between training and validation sets,we constructed the best prediction model to predict the efficacy of folic acid’s treatment to HHcy.Conclusions1.Adding genetic factors to the traditional factors had a significant improvement in predicting the efficacy of oral folic acid treatment to HHcy,which made the prediction more accurate and comprehensive.2.The efficacy of the models constructed by machine learning algorithms were all better than the traditional LR algorithm,and machine learning algorithm was suitable for clinical prediction of oral folic acid treatment to HHcy,among which the model constructed by XGBoost algorithm had the best efficacy.3.The best model was established when the dataset was split randomly by the hold-out method according to the ratio of 7:3 between training and validation sets.By comparing several evaluation indexes,the optimal splitting method for modeling was filtered to provide a scientific basis for constructing a more accurate predictive model. |