| With the improvement of people’s living standards and the aggravation of the aging of the population,diabetes and its complications have gradually become one of the main challenges affecting people’s health,seriously affecting people’s living standards.Diabetes can not be cured.Only early detection and treatment can reduce its complications and mortality.Therefore,it is of great significance for saving medical resources and reducing family burden to build diabetes predictive diagnosis model through machine learning,carry out risk assessment,carry out comprehensive screening,and intervene potential influencing factors.In the past research,scholars mainly used statistical methods to build prediction models,such as Logistic regression model,Cox proportional risk model,etc.;now,with the growth of medical data blowout,more and more machine learning algorithms can be used for disease prediction,such as decision tree and support vector machine,Xgboost,neural network,etc.As a chronic disease with many patients and no cure,the application of advanced science and technology to predict diabetes deserves special attention.On the basis of previous studies,this paper uses Python language to process diabetes data and build an algorithm framework for diabetes prediction.The main research work is as follows:1)Exploration and processing of diabetes data.Analyzing the prediction models related to the diagnosis of diabetes at home and abroad to understand the data types and model inputs.Get the original data from UCI and Tianchi platform,combine with the corresponding medical knowledge,conduct exploratory analysis of the data,find the correlation between features,and explore the structure and law of the data.Then data preprocessing,including missing value and abnormal value processing,dirty data cleaning,data standardization,sample balance,data specification,and so on,finally forming efficient and available modeling data.2)Study the disease prediction model,select the appropriate model for different problems.For the regression problem,linear regression,decision tree,support vector regression,neural network and xgboost are selected to model and predict,and the xgboost with the best performance under the regression task is selected through comparison;for the classification problem,logic regression,decision tree,support vector machine,random forest and xgboost are selected as the base classifiers,and model fusion is carried out through stacking,The method of integrated learning is used for prediction,which improves the accuracy of prediction.3)In view of the fact that machine learning model has many parameters and complex parameters,GA xgboost model is proposed to optimize the algorithm model.Based on the Xgboost model training,a genetic algorithm is introduced to set the parameters for encoding,and multiple peaks can be searched in parallel,ultimately improving the accuracy of the model.The experimental results show that in the regression task,Xgboost has obvious advantages in the three evaluation criteria of MAE,MSE,and MAPE,and the GA-Xgboost model further improves the prediction accuracy.In the classification task,the fusion model based on Stacking is also better than the single base classifier. |