Background:Stroke is a type of acute cerebrovascular disease caused by vascular abnormalities,which has become an important cause of death and disability in the global population.Between 1990 and 2019,the global number of stroke patients climbed by 70%,the incidence and fatality rate of stroke increased by 85%and 43%,respectively.Stroke has also become the leading cause of death in China.Meanwhile,China is approaching a phase of rapid aging,and stroke incidence is on the rise with clear gender inequalities and younger tendencies.Without effective preventive measures,the disease burden caused by stroke will continue to grow.Therefore,it is crucial to carry out stroke risk prediction and identify high-risk groups for its early prevention.In recent years,with the rapid development of health-related big data and computer technology,machine learning has been widely used in the medical and health fields,and it has shown excellent prediction performance.However,class imbalance is an inevitable problem in prediction tasks,which poses new challenge to accurate prediction.Until now,more and more strategies have been developed to address the issue of imbalanced data.As a result,the focus of this research is on how to employ machine learning algorithms to effectively estimate the stroke risk of different genders among middle-aged and elderly Chinese population based on imbalanced data.Methods:Data used in this study was obtained from two latest waves(2015 and 2018)of the China Health and Retirement Longitudinal Study,the 2015 wave was selected as the baseline,all pariticipants who aged 45 years and older and without stroke in baseline were included,also,participants’epidemiological and blood biomarker information were collected in the baseline wave,and 2018 wave was used to collect follow-up outcome information(stroke vs.non-stroke)after a three-year follow-up.Firstly,the study selected important predictors(epidemiological and biomarker information)using a two-step feature selection method in the training set,that is,the Gini index of random forest was firstly applied to sort the importance of all variables,and then gradually incorporate the variables into the random forest according to the descending order of their importance,the model performance was assessed by the area under the ROC curve(AUROC),and finally the important predictors were selected to construct the feature selection set if the AUROC could be improved in each iteration.Based on the feature-selection data set,several machine learning models including logistic regression,decision tree,support vector machine,random forest,extreme gradient boosting and artificial neural network and data balancing technologies such as oeversampling,undersampling and synthetic sampling techniques in the data-level,and threshold tunning,cost-sensitive learning,ensemble techniques and anomaly detection in the algorithm-level were used to predict stroke risk for different genders among middle-aged and elderly Chinese population.In the training set,a ten-fold cross validation method was applied to tune the hyperparameters of machine learning algorithms.For stroke risk prediction,we firstly evaluated the value of baseline epidemiological variables in prediction,and then assessed the performance of a combination of epidemiological and blood biomarker variables.The model performance was comprehensively evaluated by discrimination,calibration as well as clinical usefulness metrics,and the optimal model was selected based on the above metrics.For discrimination,the commonly used metrics including accuracy,sensitivity,positive predictive value and the comprehensive metrics including G-mean and AUROC were included.For calibration,the Brier score was selected.The decision curve analysis was applied to assess the clinical usefulness of prediction models.The prediction probability of the optimal model was also analyzed to explore the key predictors of the high-risk group of stroke populations.Furthermore,the Permutation Importance and LIME methods were used to analyze the global and local interpretability of the prediction models,which was aimed to uncover the models’ decision-making mechanism,and finally could help the clinical practice of risk prediction models.Results:A total of 11140 participants were included for analysis,with men accounting for 46.10%.The mean age of study population was 60.57,with people aged 60 years and older accounting for 50.9%.The mean ages of male and female were 61.11 and 60.12,respectively,and the proportion of men and women over 60 years and older was 53.1%and 49.1%respectively.With a 3-year follow up survey,the prevalence of stroke was 5.71%for whole population,with male experiencing a relatively higher prevalence than female(5.86%vs.5.58%).When only epidemiological data are included,the prediction performance of both male and female was quite low generally on the imbalanced dataset,while decision tree performed relatively well,with its G-mean reaching 0.35 and 0.32 for male and female,respectively.After using data balancing technologies,the model performance improved significantly,especially for the SMOTE algorithm of oversampling technique,the G-mean was up to 0.62 for male(Logistic)and 0.62 for female(Logistic)in the feature selection dataset.The synthetic sampling technique(SMOTETomek)also performed well,while TomekLinks was the worst one.For threshold tunning,cost-sensitive learning,and ensemble techniques,the overall performance of prediction models improved greatly compared with imbalanced data,especially for the ensemble techniques(BalancedBagging and EasyEnsemble),for example,the sensitivity and G-mean reached 0.65,0.60 based on BalancedBagging and logistic regression for male,and the sensitivity and G-mean were up to 0.74,0.64 based on EasyEnsemble and logistic regression for female.Besides,the anomaly detection algorithms also greatly improved the prediction performance,for instance,the G-mean of LocalOutlierFactor reached 0.92 for male,while the G-mean of LocalOutlierFactor algorithm reached 0.91 for female.When both the epidemiological and blood biomarker variables were considered,the performance of the prediction models improved slightly compared with thoese with only epidemiological variables in consideration,but remained stable generally.Due to the class imbalance problem,the prediction accuracy of the vast majority of models was generally high,while the sensitivity and positive prediction value of models were relatively lower.In general,logistic regression performed much better among all the prediction models,and its prediction performance was more stable between all tasks.The global interpretability analysis revealed that there were large differences in predictors between genders,that is,age and grip strength were the top 2 important predictors for male,and the blood biomarker information was less important in male prediction.While in female risk prediction,pulse pressure,platelet count,working,and hyperlipidemia were more important,and the results also revealed that the contribution of blood biomarkers instroke risk prediction has improved compared with male.The local interpretability analysis revealed that there were huge differences among predictors in individual-level risk prediction.Conclusions:For classification tasks,class imbalance will greatly limit the prediction performance of machine learning models.At the data level,SMOTE and SMOTETomek are effective class balancing techniques,which can significantly improve the performance of prediction models.Feature selection technique not only helps to reduce the dimension of predictors,but also maintains high prediction performance,which is an important means to class imbalance.Threshold tunning,cost-sensitive learning,and ensemble techniques can effectively solve class imbalance,especially when combining both data balancing and ensemble methods such as BalancedBagging and EasyEnsemble.Anomaly detection algorithms can also help to distinguish the stroke populations from normal population,especially for the LocalOutlierFactor algorithm.Logistic regression is a classic analytical model,and it also showed good and stable performance in stroke prediction.The overall prediction accuracy of male was slightly higher than female,and there were large differences of predictors between genders,indicating that targeted prevention measures are needed for different genders.Interpretability analysis can provide a deep understanding of the mechanism of black-box machine learning models and enhance decision makers’confidence in using black-box models,and eventually can help drive machine learning into clinical practice. |