Diabetes is a common chronic metabolic disease that occurs due to a variety of complex factors such as genetic and environmental factors,and people with diabetes have high blood sugar levels and may suffer from serious complications such as cardiovascular disease,kidney disease,eye disease,etc.Globally,millions of people are diagnosed with diabetes each year,and the number is rising.Chronic diseases such as diabetes not only affect people’s quality of life,but also bring a heavy burden to national medical and health expenditure.Earlier and more accurate identification and intervention of diabetes can effectively control its incidence,improve people’s happiness,and reduce the country’s medical burden.In the context of the continuous improvement of information technology and data processing capabilities,the application of machine learning technology to the medical field has become a hot trend,machine learning technology can establish predictive models based on a large number of diabetes patients’ data,so as to accurately provide help for improving prevention and treatment.Based on the diabetes dataset released by the Alibaba Cloud Tianchi Competition,this thesis first sorts out the current situation of diabetes prediction and the research status of boosting algorithm at home and abroad,expounds the basic theories of feature selection,boosting model,model parameter optimization,etc.,and then preprocesses the data: removing abnormal points and missing data,means filling the missing features,encoding and processing discrete variables,and normalizing the data of different dimensions.According to the characteristics of high dimensional data,two feature selection methods,comprehensive filtering method and embedding method,were used for feature selection,and a total of 18 important features were screened out for modeling.Before modeling,the dataset is divided into training set and test set,and then decision tree,Adaboost,Xgboost and Catboost models are established based on the training set data,and the model hyperparameters are optimized through grid search,random search and Bayesian optimization to obtain a boosting model with the optimal parameter combination and output feature importance score.In order to further improve the prediction accuracy of the boosting model,this thesis uses the stacking method to fuse the three boosting models of Adaboost,Xgboost and Catboost models with the best combination of the trained parameters.Finally,each model is evaluated according to the MSE,RMSE and MAE of the model on the test set.It is found that:(1)The model evaluation results of each boosting model are better than those of its weak learner decision tree regression model,among which Catboost model evaluation effect is the best,and the values of MSE,RMSE and MAE are 0.7762,0.8810 and 0.5161,respectively.(2)Model fusion can effectively improve the accuracy of the model,and the values of MSE,RMSE and MAE of the boosting model based on model fusion are 0.7545,0.8686 and 0.5023,respectively.(3)Combined with the variable importance score results of the three boosting models,age,triglycerides,*aspartate aminoconvertase,*R-glutamyl convertase,uric acid and *alkaline phosphatase were ranked high,and blood glucose concentration was largely affected by these factors.The age variable is much more important than the other variables,indicating that the older you are,the higher the likelihood of diabetes. |