Diabetes is a chronic disease mainly characterized by hyperglycemia.Diabetes will cause a series of complications and endanger people’s health.So far,there is no thorough treatment for diabetes.Early screening for diabetic patients will help reduce the incidence of diabetes.At present,China has a large population and basic resources are relatively scarce,so manual screening is difficult to achieve good results.Therefore,data mining algorithm can be applied to efficiently process diabetes-related data,which helps early detection and treatment of diabetes and is of great significance to improve the public health level of China.In previous studies,more and more data mining algorithms can be used for chronic disease prediction,such as KNN,random forest,LightGBM,neural network,etc.On the basis of summarizing the research results of the predecessors,this article summarizes the research progress of the predecessors in the construction of diabetes data processing and prediction algorithms.Through the construction of models to predict the risk of diabetes,the main research work is as follows:1.According to the characteristics and requirements of subsequent algorithms,targeted data preprocessing is carried out.As the data processing sets used in this paper are all from the public platform,they has greater reliability.First,the original data set of diabetes were classified and studied to obtain the association between data attributes.Then the data were preprocessed,which mainly involved the processing of missing values and variation values and the standardization of values.Finally,the available data set of diabetes warning mode were established.2.The filter feature selection method based on RF is studied and optimized.First,all the features are sorted according to the importance degree of random forest characteristics.Then,the characteristics of the high importance of the preceding top variables are selected to train the classification model.Finally,the subset with the best performance is selected according to the classification accuracy in the multiple feature subset,which is used as the input characteristic variable of the diabetes classification prediction model.3.The regression prediction model of diabetes risk is studied and optimized based on LightGBM.In this paper,three kinds of LightGBM models are proposed and studied for the regression prediction of diabetes,namely,the LightGBM model optimized by Bayesian,the LightGBM model optimized by genetic algorithm and the LightGBM model optimized by random search algorithm.After the model was constructed to predict diabetes mellitus,root mean square error(RMSE)and other model evaluation indexes were used to evaluate and compare the prediction results of the model.A series of experiments show that the LightGBM prediction model optimized by Bayesian algorithm has better performance than the LightGBM model optimized by other algorithms,and its comparative analysis with other regression models shows that its performance is also better than other models.4.The diabetes risk classification prediction model with improved Stacking methods is proposed.After the diabetes data was analyzed and processed,a single data mining algorithm was first used to build a diabetes prediction model,followed by a stacking approach to build a fusion model of these four methods.Based on the disadvantages of traditional Stacking methods,two processes of base learner generation and base learner fusion are improved.In order to increase the difference between base learners in ensemble learning,bootstrap method is used to generate several training subsets.In the process of model fusion,an optimized and improved stacking classification prediction model is proposed according to the weight distribution coefficient of the basic learner. |