| Objectives(1)To comprehensively and systematically assess the association between environmental factors and low birth weight(LBW)in the offspring,which could help to provide evidence for the development of prevention and control measures for LBW.Here,the environmental factors include socio-demographic characteristics,previous pregnancy history,health status during the peri-pregnancy period,medication use during the peri-pregnancy period,behavioral habits during the peri-pregnancy period,exposure to environmental hazardous substances during the peri-pregnancy period,and occurrence of pregnancy-related complications,which are collected through a prospective cohort study.(2)To develop prediction models of LBW using machine learning algorithms as well as data collected through a prospective cohort study(especially the factors collected in early pregnancy period or before pregnancy),and to find the optimal one by internal testing and external validation of predictive performance,which could provide a "pre-screening" tool for early identification of pregnant women at high risk for LBW.Machine learning algorithms include logistic regression,decision tree,naive Bayesian discrimination,k-nearest neighbor(kNN),support vector machine(SVM),random forest(RF)and XGBoost.MethodsA prospective cohort study was conducted among pregnant women and their children in several maternal and child health care hospitals in Hunan Province.From August 2014 to December 2019,pregnant women who received their first antenatal care between 8-14 weeks of gestation were approached and invited to join the cohort,and those who met the inclusion criteria were followed up until 3 months postpartum to collect detailed information on infant illness such as LBW.Exposure-related information was collected by epidemiological questionnaires combined with querying the hospital-based electronic medical record system,while pregnancy outcome-related information was collected by querying the hospital-based electronic medical record system.After the data collection was completed,the analysis of influencing factors for LBW and the construction of predictive models for LBW are performed.(1)Descriptive statistics were used to assess the characteristics of study participants,and Chi-square test,continuous corrected chi-square test,or Fisher’s exact probability method was employed to compare categorical variables.Variables that might be associated with LBW(P <0.05)were included in the multivariable Poisson regression model to estimate the relative risks(RRs)and their corresponding 95% confidence intervals(CIs),and identify the independent influencing factors of LBW(P<0.05).(2)The dataset consisted of outcome and predictive variables were used for model constitution.The outcome variable was whether the offspring had LBW or not.Maternal pre-pregnancy or early pregnancy exposure factors were included as potential predictor variables.Firstly,the dataset was randomly divided into a training set(70%)and a test set(30%).Since the training dataset was unbalanced for the positive and negative class values of the outcome variable and the positive values were relatively rare,the Synthetic Minority Over-sampling Technique(SMOTE)was applied to obtain an equal representation for each value of the class.Based on the resampled training set,a two-stage feature selection method was proposed to identify optimal predictive features for LBW.Seven machine learning algorithms(i.e.,logistic regression,native Bayes discrimination,C4.5 decision tree,support vector machine,k-nearest neighbor,random forest,and XGBoost)were applied to construct the predictive models based on the SMOTE balanced training dataset,with the selected features as the predictive variables and the risk of LBW in offspring as the outcome variable.Then,the test set was used to evaluate the model performance by computing the model evaluation indices such as accuracy,sensitivity,specificity,positive predict value(PPV),negative predict value(NPV),and area under the ROC curve(AUC).Meanwhile,validation of the models’ prediction performance was conducted in an external validation set,in which the data were data collected from another cohort study in early pregnancy(the process of cohort construction was the same as the above cohort).Finally,the results of the internal testing and external validation of the models were combined to determine the optimal prediction model for LBW.Results(1)In the study of influencing factors of LBW as well as the study of construction and internal validation of prediction models for LBW,a total of 34104 pregnant women were included with an incidence of 8.9%(95%CI: 8.6%-9.2%)for LBW in the offspring.In the external validation study of prediction models for LBW,a total of 9249 pregnant women were included with an incidence of 7.7%(95%CI: 7.2%-8.3%)for LBW in the offspring.(2)Multivariate Poisson regression analysis indicated that maternal factors including aged less than 25 years(RR=1.58,95%CI: 133-1.88),between 35-39.9 years(RR=1.40,95%CI: 1.25-1.57)and ≥ 40 years(RR=1.27,95%CI: 1.04-1.55),living in rural areas(RR=1.19,95%CI: 1.02-1.38),with more gravidas(RR=1.43,95%CI: 1.15-1.79),with a history of adverse pregnancy(RR=1.35,95%CI: 1.20-1.52),with pre-pregnancy hypertension(RR=2.38,95%CI: 1.67-3.40),with pre-pregnancy heart disease(RR=1.41,95%CI: 1.04-1.91),with pre-pregnancy blood disease(RR=2.27,95%CI: 1.76-2.92),with pre-pregnancy antiphospholipid syndrome(RR=2.93,95%CI: 1.50-5.70),with pre-pregnancy tuberculosis(RR=3.37,95%CI: 2.21-5.15),with pre-pregnancy syphilis(RR=1.91,95%CI: 1.11-3.29),having systemic infection in early pregnancy(RR=1.994,95%CI: 1.107-3.592),not taking folic acid in 3 months before pregnancy or in early pregnancy(RR=1.81,95%CI: 1.52-2.16),alcohol drinking in 3 months before pregnancy(RR=2.20,95%CI: 1.68-2.88),alcohol drinking in early pregnancy(RR=2.20,95%CI: 1.68-2.88),house decoration in 3 months before pregnancy or in early pregnancy(RR=1.29,95%CI: 1.07-1.56),with a pre-pregnancy BMI of < 18.5 kg/m~2(RR=1.74,95%CI: 1.54-1.96),with less than 10 kg gestational weight gain(RR=3.12,95%CI: 2.82-3.45),malnutrition(RR=4.43,95%CI: 3.45-5.69),unbalanced diet(RR=1.27,95%CI: 1.12-1.45,having moderate/severe physical burden during pregnancy(RR=1.19,95%CI: 1.07-1.33),complicated with anemia(RR=1.36,95%CI: 1.16-1.59),complicated with preeclampsia(RR=21.38,95%CI: 17.52-26.09),complicated with intrahepatic cholestasis of pregnancy(RR=1.50,95%CI: 1.13-1.99),complicated with placental abruption(RR=5.38,95%CI: 3.48-8.32),complicated with premature rupture of membranes(RR=1.78,95%CI: 1.44-2.195 were associated with an increased risk of LBW in the offspring.In addition,abnormalities in several blood test index were also associated with an increased risk of LBW,including: white blood cell count(RR=1.48,95%CI: 1.31-1.68),red blood cell count(RR=1.54,95%CI: 1.38-1.72),haematocrit(RR=1.36,95%CI: 1.20-1.54),neutrophil count(RR=1.64,95%CI: 1.43-1.89),percentage of monocytes(RR=1.63,95%CI: 1.39-1.92),prothrombin time(RR=2.13,95%CI: 1.66-2.73),albumin-globulin ratio(RR=1.33,95%CI: 1.20-1.47),total bilirubin concentration(RR=1.95,95%CI: 1.22-3.10),total bile acid concentration(RR=1.56,95%CI: 1.30-1.88),glutamic oxaloacetic transaminase concentration(RR=1.38,95%CI: 1.22-1.56),urea concentration(RR=1.27,95%CI: 1.15-1.41),creatinine concentration(RR=3.44,95%CI: 2.48-4.76),uric acid concentration(RR=1.51,95%CI: 1.32-1.72),potassium concentration(RR=1.49,95%CI: 1.29-1.72),chlorine concentration(RR=1.72,95%CI: 1.22-2.43),and C-reactive protein concentration(RR=1.77,95%CI: 1.59-1.97).(3)Results of multivariate Poisson regression analysis showed that,characteristics of pregnant women including having an education level of high school and secondary school(RR=0.57,95%CI: 0.49-0.65)or college(RR=0.42,95%CI: 0.36-0.48)or bachelor’s degree or higher(RR=0.56,95%CI: 0.48-0.66),respiratory infections in early pregnancy(RR=0.43,95%CI: 0.23-0.80),with a pre-pregnancy BMI of 24 kg/m~2-27.9kg/m~2(RR=0.472,95% CI: 0.377-0.592),with a pre-pregnancy BMI of ≥28 kg/m~2(RR=0.27,95%CI: 0.20-0.36),and with more than 20 kg gestational weight gain(RR=0.76,95%CI: 0.65-0.89)were associated with a decreased risk of LBW in the offspring.In addition,abnormalities in several blood test index were also associated with a decreased risk of LBW,including: platelet specific volume(RR=0.87,95%CI: 0.79-0.96),fibrinogen concentration(RR=0.82,95%CI: 0.74-0.91),plasma fibrinogen degradation products(RR=0.79,95%CI: 0.70-0.90),blood D-dimer concentration(RR=0.72,95%CI: 0.64-0.81),and triglyceride concentration(RR=0.42,95%CI: 0.34-0.51).(4)Based on the internal testing set,machine learning analysis indicated that the prediction accuracy,sensitivity,specificity,PPV,NPV and AUC were 83.4%,41.4%,87.4%,24.0%,93.9%,and 0.644 for logistic regression model;72.5%,47.0%,74.9%,15.3%,93.6%,and 0.610 for C4.5 decision tree model;81.8%,44.8%,85.4%,22.8%,94.1%,and 0.651 for naive Bayesian discrimination model;94.6%,84.5%,95.5%,64.5%,98.6%,and 0.904 for k-nearest neighbor model;94.1%,86.0%,94.8%,61.6%,98.6%,and 0.904 for support vector machine model;91.2%,83.2%,91.9%,49.8%,98.3%,and 0.876 for random forest model;and 93.9%,82.2%,95.1%,61.5%,98.2%,and 0.886 for XGBoost model.(5)Based on the external validation set,machine learning analysis indicated that the prediction accuracy,sensitivity,specificity,PPV,NPV and AUC were 65.5%,46.9%,67.2%,10.7%,93.8%,and 0.571 for logistic regression model;57.4%,52.8%,57.8%,9.4%,93.6%,and 0.553 for C4.5 decision tree model;64.5%,49.3%,65.8%,10.7%,93.9%,and 0.575 for naive Bayesian discrimination model;81.9%,83.3%,81.8%,27.7%,98.3%,and 0.826 for k-nearest neighbor model;75.0%,86.6%,74.1%,21.6%,98.5%,and 0.904 for support vector machine model;72.9%,84.3%,71.9%,20.1%,98.2%,and 0.781 for random forest model;and 74.0%,84.0%,73.1%,20.7%,98.2%,and 0.796 for XGBoost model.Conclusions(1)Among the environmental factors,maternal variables including socio-demographic characteristics(age,residence,education level),previous pregnancy history(gravidity,history of adverse pregnancy),pre-pregnancy health status(hypertension,heart disease,blood disease,antiphospholipid syndrome,tuberculosis,syphilis),taking folic acid during the peri-conception period,behavior habits in 3 months before pregnancy or in early pregnancy(alcohol drinking in 3 months before pregnancy,alcohol drinking in early pregnancy),house decoration during the peri-conception period,nutritional status during the peri-conception period(pre-pregnancy BMI,gestational weight gain,malnutrition,inmbalanced diets,levels of physically active),and pregnancy-related comorbidities or complications(anemia,preeclampsia,intrahepatic cholestasis of pregnancy,placental abruption,premature rupture of membranes)are associated with LBW in offspring.(2)Abnormalities in multiple blood test index in early pregnancy are associated with the risk of LBW in the offspring,including white blood cell count,red blood cell count,haematocrit,neutrophil count,percentage of monocytes,platelet specific volume,prothrombin time,fibrinogen concentration,plasma fibrinogen degradation products,blood D-dimer concentration,albumin-globulin ratio,total bilirubin concentration,total bile acid concentration,glutamic oxaloacetic transaminase concentration,urea concentration,creatinine concentration,uric acid concentration,triglycerides concentration,potassium concentration,chloride concentration,and C-reactive protein concentration.(3)The k-neareast neighbor model is rated as the optimal prediction model for LBW by combining the results of internal testing and external validation of the seven models.Its prediction accuracy,sensitivity,specificity,PPV,NPV and AUC is 94.6%,84.5%,95.5%,64.5%,98.6%,0.904,respectively in the internal test set,and 81.9%,83.3%,81.8%,27.7%,98.3%,0.826,respectively in the external validation set. |