| Small for gestational age(SGA)infants are most commonly defined as those with birth weights below a certain weight.The weight of these infants is always below the one of the normal infants.Moreover,SGA children often face many difficulties during the perinatal period,their school years and adulthood.Therefore,diagnosing infants who are SGA individuals before birth,even before pregnancy,is critical for helping physicians introduce SGA interventions to infants earlier and improving the overall prognosis of these infants.With the development of computer technologies,machine learning(ML)techniques provide a new tool for the prediction tasks in medical domain.However,ML has not been widely studied for SGA detection.To develop effective SGA prediction models,multiple ML algorithms are studied in the SGA datasets.SGA infants collected from 2010 to 2013 with gestational weeks between 24 and 42 were used in our work.The main content of this dissertation is summarized as follows:1)Preprocessing of SGA datasetIn view of there are many problems in the original data,so before using it the process of data preprocessing are needed.In the work,diagnostic criteria of SGA infants is selected;the cases and controls of SGA infants are filtering;variables of the SGA dataset are created and the missing values are handled.2)Feature selection for SGA datasetIn this work,in order to utilize the SGA features more efficiently to build prediction model,the feature selection is conducted to select optimal feature subsets to build prediction model.In this work,an expert knowledge based Fiter-Wrapper hybrid feature selection method is proposed.In this method,the knowledge driven and the data driven features are combined,which considers both expert knowledge and data insight.It is an effective feature selection method that balances the computing cost and performance.3)Building SGA prediction models using ML algorithmsClassic ML algorithms,such as support vector machine(SVM),random forest(RF),logistic regression(LR)and Sparse LR are applied to build SGA models.A comparative study of the four models is conducted.The results showed that with the help of the expert knowledge based Filter-Wrapper hybrid feature selection method,Sparse LR obtained the highest AUC(Area Under the Receiver Operating Characteristic Curve)value of 0.8376.4)Improving SGA prediction models by handling imbalance dataConsidering the imbalance problem existed in the SGA dataset,the process of dealing with the imbalance problem is conducted to improve the SGA prediction models.In this paper,a simple undersampling based Bagging ensemble learing method is proposed to handle the imbalance problem.As a result,more effective SGA prediction models were obtained.The model that handled the imbalance problem and utilized the expert knowledge based Filter-Wrapper hybrid feature selection method and RF algorithm obtained the best performance,with an AUC value of 0.8547.5)Building SGA models based on features in the period of pre-pregnancy,pregnancy and postpartum.Considering the time attribute of this SGA dataset,three feature subsets are constructed based on the features in the period of pre-pregnancy,pregnancy and postpartum.Then,three groups of prediction models are built.From the performance of the models,it is easy to find that the models built in the stages of pre-pregnancy and pregnancy can predict SGA effectively.And SVM obtained the highest AUC values in both stages,with values of 0.8110 and 0.8120,respectively. |