| With the continuous development of artificial intelligence and science and technology.Internet technology has begun to play a huge role in disease prediction.Cerebrovascular disease.commonly known as’ stroke ’,is a disease in which blood vessels in the brain break and bleed.and blood flow forms a blood clot in the inner surface of the cardiovascular spasm or repair.resulting in blood clots in the brain.Therefore.the establishment of an efficient disease prediction model plays an important role in disease prevention.Existing work mainly uses traditional machine learning methods for modeling and analysis.However.because traditional machine learning algorithms have a single classifier,they often cannot show very good effects.In this paper.integrated algorithms with higher stability are used for model construction.and the integrated model is used as the base classifier and logistic regression as the meta-classifier for Stacking fusion.The results of Voting fusion were compared with those of Voting fusion.Finally.the voting fusion model was replaced by logistic regression as the meta-classifier of the Stacking model.The two fusion models were combined to realize a more accurate prediction study of cerebrovascular diseases.This paper first introduces the relevant theories of common algorithms and uses the open dataset heal the are_dataset_stroke on Kaggle for data cleaning and feature selection.Due to the imbalance of the data set.SMOTE oversamplings,SMOTE SMOTE sampling.Boderline SMOTE sampling and ADASYN sampling,respectively.Model the sampled data with logistic regression and integration algorithms Random Forest,XGboost and LightGBM to get the AUC of each model.The simulation was repeated 100 times to take the mean value of AUC.Finally the ADASYN algorithm with the highest total mean value of AUC was selected after unbalanced treatment.and the positive and negative sample size was 4953 and 4860 respectively.The second is the training model.which divides the data into the training set and the test set according to the ratio of 8:2.uses logistic regression.integrates the algorithm,and selects the optimal parameters for modeling.Next,the integration model is used as the base classifier and the logistic regression algorithm as the meta-classifier for Stacking fusion.At the same time,Voting fusion is performed.The performance effects of the two fusion models are compared and analyzed.In order to further improve the stability and accuracy,the Stacking model and the Voting model are combined to predict.The mean value of evaluation indexes of each model can be obtained by repeated training for 100 times.The commonly used classification evaluation indexes include accuracy,accuracy,recall rate.Fi-score and AUC values.Considering the particularity of this data set,the risk of diseased samples being judged as free from disease is much higher than that of non-diseased samples being judged as sick.F2-score will be used instead of F1-score to give higher weight to recall rates.According to the above indexes,the pros and cons of each model are considered comprehensively.The results show that the evaluation effect of the Stacking fusion model and Voting fusion model using stochastic forest,XGboost and LightGBM integrated algorithms as base classifiers is superior to logistic regression.The evaluation indexes of Voting fusion model are superior to Stacking fusion model.After the combination of the two fusion models.the F2-score of the model was 0.98599.and the AUC value was up to 0.99101.which showed the best prediction effect.Therefore,it can be considered that the combined model has achieved a more accurate prediction study of patients with cerebrovascular diseases,and achieved disease warning and early intervention for high-risk groups. |