| Objectives:Support vector machine(SVM),Category Boosting Gradient lifting(Cat Boost),Back Propagation(BP)neural network and deep learning algorithm were used to build the prediction model based on the National Health and Nutrition Survey(NHANES)data in 2011-2018.The accuracy of the four methods in predicting depression in each sample set was compared,and the application value of machine learning algorithm in depression prediction and auxiliary diagnosis was discussed.Methods:The general conditions of the subjects and patients with depression were described.Variables with missing values of more than 10%in the NHANES data from 2011 to2018 were excluded,the missing values of the retained variables were interpolated by mean and same-kind mean and resample the data to solve the problem of data imbalance.Stepwise regression was used to screen the characteristic variables of depression,and the filtered data was divided into training set and test set according to 7:3.On the training set,four machine learning models of SVM,Cat Boost,BP neural network and deep learning were established using the e1071 package,Catboost package,nnet package and H2O platform in R software.Sensitivity,specificity,accuracy and receiver operator characteristic curve(AUC)were used to evaluate the prediction effect of the machine learning model on the test set.This process was repeated in different sample sets to select the optimal prediction model for depression,and the optimal model was used to screen the influencing factors of depression.Results:1.General information:among the 19,406 adults aged 20 and above,there were9515 males and 9891 females,accounting for 49.0%and 51.0%respectively;there were1747 patients with depression,accounting for 9.0%,and 17,659 patients without depression,accounting for 91.0%.χ~2 test showed that gender,age,race,education level,marital status,income and BMI were significantly correlated with depression(P<0.05).2.Selection of feature set variables:There were 247 features in this study.After using stepwise regression screening variables,the final modeling variables for demographic feature set prediction of depression were 9 variables,such as the number of family members,race and age.The final modeling variables of the laboratory feature set were 14 variables including phosphorus(mmol/L),creatine phosphokinase(IU/L)and high density lipoprotein cholesterol(mmol/L).The final modeling variables of the dietary characteristic set were 12 variables including alcohol(g),dietary energy intake(kcal)and octadecartrienoic acid(g).The final modeling variables of the questionnaire feature set were 24 variables including hypertension,smoking and stroke.The final modeling variables of the physical activity feature set were 7 variables,such as sedentary time,strenuous activity and the need for special equipment when walking,and the total feature set consisted of 66 variables.3.Machine learning prediction of depression:SVM,Cat Boost,BP neural network and deep learning model have the best prediction performance in the total feature set,and the lowest prediction performance in the dietary feature set.In the demographic feature set,the AUC values of the four models were 0.697,0.701,0.691 and 0.712,respectively.In the laboratory feature set,the AUC values of the four models were 0.635,0.620,0.641 and 0.655,respectively.In the dietary feature set,the AUC values of the four models were 0.602,0.587,0.594 and 0.619,respectively.In the questionnaire feature set,the AUC values of the four models were 0.785,0.831,0.824 and 0.838,respectively.In the physical activity feature set,the AUC values of the four models were 0.736,0.761,0.760 and 0.789,respectively.In the general feature set,the AUC values of the four models were all above 0.8,which were 0.849,0.854,0.853 and 0.863,respectively.The best predictor of depression was the deep learning model.4.Screening of influencing factors of depression:Deep learning model and Cat Boost model were used to screen the important factors of depression.The top 5important characteristics of deep learning model were sleep disorder,general health status,work restriction,activity restriction and alcohol intake,and their importance factors were 1.000,0.976,0.858,0.776 and 0.745,respectively.The top 5 important characteristics of Cat Boost model were general health status,creatinine,sleep disorder,memory difficulty and work limitation,and their important factors were 19.361,19.349,13.489,9.930 and 5.434,respectively.Conclusion:1.The prevalence of depression among American adults is 9.0 percent,with female and adults with lower income and education being more likely to suffer from depression.2.SVM,Cat Boost,BP neural network and deep learning models are feasible in the application of depression prediction.Deep learning model is the best model to predict depression,followed by Cat Boost model.3.The top 5 important characteristics screened by deep learning model were sleep disturbance,general health status,work restriction,activity restriction and alcohol consumption.The leading cause of depression in male is sleep disturbance,and in female it is alkaline phosphatase.4.The top 5 important characteristics screened by Cat Boost model were general health status,creatinine,sleep disorder,memory difficulty and work limitation.The leading cause of depression in male is sleep disturbance,and in female it is general health. |