Font Size: a A A

Research On Network Structure-Driven Models For Biomarker Selection And Disease Prediction

Posted on:2017-05-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:X S ZhangFull Text:PDF
GTID:1224330485479614Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Epidemiology aims to study the distribution and risk factors of disease or health status in the population level, and proposes strategies for disease prevention and health promotion. Risk factors (such as biomarkers) selection provides a solid foundation for identifying the cause of disease as well as predicting the occurrence and prognosis of the disease. Hence accurately selecting the risk factors and predicting disease outcome is the key of disease prevention. Most diseases are caused by the interaction of genetic and environmental factors (personal living habits, physiological and psychological factors, environmental pollution, etc.).Complex interactions between genetic and environmental factors often collaborate within a network system, which regulates the disease occurrence, development and prognosis. Hence, network structure should be considered in both risk factor selection and disease prediction to extract the information between the variables.Regression-based methods are commonly applied in risk factor selection and disease prediction. Regression equations typically involve simplified assumptions, such as linearity and additivity. Although nonlinear regression such as regression splines allows additive effects to be nonlinear, it is still under the framework of additive equation. The assumption of regression models is a double-edged sword:it facilitates the analysis procedure, and the effect can be explicitly interpreted by the regression coefficients; but these assumptions arbitrarily ignore the network structures existed in the dataset, and only reflect the linear effect by the isolated predictors that is bound to lose important information. The interaction between the predictors in the regression model can be represented by a product term (such as the interaction between the two variables, multi-level interaction effects, and complex interaction effects embedded in a pathway), but it is difficult to reveal the whole network. Moreover, collinearity and dimension of the variables will increase dramatically when the number of interaction term is overly large, resulting in severe estimation bias. Hence, thinking globally in the whole network perspective and acting locally in a specific pathway to estimate the effect for disease is the promising future of epidemiologic inference and disease prediction. In this dissertation, we focused on risk factor or biomarker selection and disease prediction within the framework of network structure.With the development of high-throughput technique and the decrease of its cost, a variety of biological markers (genome, transcriptome, epigenome, proteome, and metabolome) have been detected, which enables epidemiologists obtain both exposure and biomarkers in population level.These large amounts of data provide a basis for the network-driven analysis. Therefore, in this dissertation, we incorporated network-driven methods to three typical contexts for risk factor selection and disease prediction, including 1) gene interaction network structure-based Bayesian variable selection model for biomarker selection in Chapter 2; 2) Bayesian network-based models for disease screening in Chapter 3; 3) Bayesian network and competing risk-based cause-specific model for disease risk prediction in Chapter 4.1 Gene network structure-driven Bayesian biomarker selection model (Chapter 2)Two typical analysis strategies are commonly used in genome-wide association study (GWAS) including statistical inference and variable selection. Statistical inference adopts statistical tests to identify the causal markers by P value, which is acquired from a specific statistic by comparing two groups such as case and controls for each SNP. A SNP is supposed to be a potential causal if the P value of this SNP is smaller than a pre-defined significance level α. Variable selection methods such as LASSO regression and ridge regression determine a subset of SNPs to fit a best model. These two strategies both ignore the structure of gene interaction network, which inevitably lose information between SNPs. Therefore, we proposed a gene network structure-driven biological markers (SNP) selection model, in which gene network layer was introduced between the SNP and disease phenotypes. Bayesian hierarchical model strategy was employed to incorporate the gene network topology prior. This method is especially suitable for data from genome-wide gene exome chip, which consists of gene-based SNP genotype from exon region within a gene.The proposed gene network structure-driven Bayesian biomarker selection analysis included the following steps:1) the gene network topology was obtained from the KEGG database (http://www.kegg.jp/), in which the biological network structures are confirmed by biological experiments. We constructed an adjacency matrix R to indicate the gene network structure (Rij-1 if the ith gene and jth gene are linked, otherwise Rij=0).2) The Bayesian hierarchical model was constructed as Zi,=(T(ξ,γ)β(ξ,γ))i+sεi,εi~-N(0,1), where Z is the latent score of the phenotype, T(ξ,γ) denotes the gene score, β(ξ,γ) is the gene effect to phenotype, two indicator variables ξ= (ξ1,…,ξJ) and γ= (γ1, …γP) were employed to indicate whether the corresponding SNPs and genes were chosen in the model.3) We assumed the gene selection indicator ξ= (ξ1,…,ξJ) followed Markov Random Field (MRF) distribution, and SNP selection indicator followed Bernoulli distribution.4) The posterior distribution for the parameters in the model was generated from Markov chain Monte Carlo (MCMC) sampling procedure after theoretical derivation.5) We ranked the posterior probability for each SNP, and the top i SNPs were added in the model one by one. The best model was determined using ten-fold cross-validation (CV), and the SNPs in the best model were the potential causal SNPs for the disease or phenotype.Results:(1) Three datasets under different scenarios were generated with the consideration of the variance of phenotype latent variable Z explained by SNPs indexed as GV30, GV50, and GV70 (GV70:SNPs explained 70% of variance ofZ; GV50:SNPs explained 50% of variance of Z; GV30:SNPs explained 30% of variance of Z). The simulation results were as follows:1) For the ability of identifying causal SNPs, the AUC of the proposed ND-BVS increased with the explained variance (0.792/GV30.0.894/GV50.0.911/GV70). It outperformed LASSO (0.779/GV30,0.882/GV50,0.891/GV70) and stepwise method (0.774/GV30. 0.853/GV50.0.869/GV70).2) For the ability of predicting the phenotypes, the proposed ND-BVS model was also superior compared with the LASSO and Stepwise approaches.(2) The above three methods were employed to analyze the leprosy GWAS data with 706 cases and 514 controls.3388 Of 492109 SNPs were selected to enter the ND-BVS model after single locus logistic model screening with significance level 0.0001. The ND-BVS method selected 94 SNPs, while LASSO and stepwise selected 100 and 3 SNPs respectively. Among the positive SNPs identified by ND-BVS,five SNPs were external validated in the two stage GWAS of leprosy, followed three by LASSO and one by stepwise.Conclusion:The proposed ND-BVS model improved the model ability in both identifying the causal SNPs and predicting the disease phenotype compared to traditional stepwise method and LASSO, by embedding gene-gene network structures in the model.Innovation:A gene network structure-driven Bayesian biomarker selection model was developed within the framework of Bayesian hierarchical model, which provided a novel method especially for genome-wide gene-based exome analysis.2 Network structure-driven model for disease screeningDisease screening is a disease prevention measure that discriminates healthy and unhealthy people and makes inferences about the disease or defect that has not been identified. The disease or phenotype-related risk factors are usually acquired from a cross-sectional survey, including living habits, physical measurement index, biochemical markers, serological markers and genetic markers. Statistical disease screening models are conducted using risk factors as the input variable and disease or phenotype as the output variable. Regression models such as logistic regression are still preferred with the form of linear additive equation, which only reflect the linear and additive effect of the predictors. Although interaction between the predictors can be considered by adding product terms in the regression model, it is difficult to reflect the whole network using interaction terms. The regression modeling strategy would be ineffective with huge number of predictors with complex relationships. Machine learning algorithms such as neural network have been developed to reveal the complex nonlinear relationships between the predictors and the outcome, which improves the prediction accuracy to some extent. However, it does not get rid of limitations of regression model that ignores the regulatory relationships between variables, and yields an overfitting problem. Hence, we proposed a Bayesian network-based disease screening model based on conditional independence criterion, which fully extracted the regulatory relationships between variables and improved the screening ability of the model.Bayesian networks describe the independent and dependent relationships between variables through the network topology of a directed acyclic graph (DAG). The nodes in the network represent the variables, and the edges indicate the direct dependency between variables. Learning a Bayesian network includes two steps, structure learning and parameter learning. The network structures are learned by combining prior knowledge about the biological network and computer learning algorithm. The parameters (i.e., conditional probability) are estimated by maximum likelihood method based on the learned network structure. Simulation studies have been conducted to evaluate the validity of the proposed network structure-driven disease screening model. The AUC corrected by 10-fold cross validation (AUC-CV) were employed to assess the discriminatory ability of Bayesian network-based model compared with logistic regression and neural network. The simulation was conducted as follows:1) data under two kind of null hypothesis (all variables were generated independently or the predictors were network constructed but not associated with the disease) were generated to evaluate whether the AUC of the models are close to 0.5.2) Different network structures (regular network, wheel network and chain network) were generated using Bayesian network algorithm to assess whether regression-based methods (logistic regression, neural network) will lose power ignoring the network structures.3) data without network structure was generated using a logistic model to explore whether Bayesian network is a robust method to deal with data under linear and additivity assumption.Results:1. Simulation results showed:1) under null hypothesis (all variables were generated independently or the predictors were network constructed but not associated with the disease), the AUC without cross validation of all the three methods were far from 0.5, with neural network deviating the most, followed by Bayesian network and logistic regression. The AUC were getting close to 0.5 with the increase of sample sizes. While the AUC-CV of all the methods were close to 0.5 with sample size larger than 500, illustrating the AUC-CV could be a convincing indicator to assess the discriminatory performance.2) Simulation results showed that the discriminatory ability (AUC-CV) of all the three methods varied quite slightly with sample size, while the stability of model were sensitive with sample sizes, thus we recommended a larger sample size (more than 500) for constructing disease screening model.3) when data was in regular network structure generated based on Bayesian network algorithm, Bayesian network had the best discriminatory performance (AUC=0.72; take sample size 500 for instance), while those of logistic regression and neural network were 0.60 and 0.62 respectively, indicating that the disease screening model ignoring the network structure will inevitably lose information.4) when data was in chain network structure, Bayesian network has similar performance (AUC-CV=0.66) with neural network (AUC-CV=0.63), but logistic regression almost lost its power (AUC-CV=0.56).5) when data was in wheel network structure, Bayesian network was equivalence with logistic regression, thus all the three methods had similar AUC-CV (AUC-CV=0.65). It showed that Bayesian network had comparable performance with logistic regression when the variables were independently associated with the disease phenotype.6) when data was generated using a logistic model, all the three methods had nearly same discriminatory ability (AUC-CV=0.8). This further confirmed Bayesian network had similar performance with logistic regression when the variables were independently associated with the disease phenotype.2. In the real data analysis, Bayesian network, logistic regression and neural network were employed using 16 verified causal SNPs to construct the leprosy screening model, with 706 cases and 514 controls. After ten-fold cross validation, Bayesian network outperformed other two methods with the AUC-CV 0.7152. The AUC-CV of logistic regression and neural were 0.6976 and 0.6794 respectively.Conclusion:The disease screening models ignoring the network structures between the predictors and disease phenotypes inevitably lose discriminatory ability to some extent, and the discrimination performance of the models can be improved by incorporating the network information. Bayesian network has equivalent discriminatory ability of logistic regression model when network structure does not exist between the variables.Innovation:We proposed a new disease screening modeling strategy incorporating network structure information between the variables, and proved the models ignoring the network structures are bound to lose discriminatory ability. The proposed model provided novel idea to enhance the discriminatory ability of the disease screening model.3. Network structure-driven model for disease risk predictionThe basic task of disease risk prediction is to predict the individualized absolute risk of developing a disease of interest during a given time period before the disease occurs. Absolute risk is defined as the probability that a person with a given set of risk factors and free of the disease of interest at age a will develop the disease before subsequent age α+τ, where τ is the duration of the interval over which risk is projected. It is not uncommon for a subject in a study to experience more than one event, and the occurrence of one possible event will prevent any other from happening, which is called the competing risk. For instance, in predicting individualized absolute risk of stroke, if an individual died of lung cancer during the follow-up, his probability of developing stroke will be zero. Thus competing risk should be considered in predicting the absolute risk of a specific event to improve the prediction accuracy. Hence, disease risk prediction model are commonly analyzed by using proportional cause-specific hazard models and proportional sub-distribution hazards models based on competing risks. Cause-specific hazard models are very flexible that can be used for different study designs, such as cohort study as well as case-control study. The steps for constructing cause-specific model are as follows:Suppose there are N subjects in the study population, n subjects develop the disease of interest in a specific time period, thus we have n cases and N-n non-cases. Let Xi’= (Xi1,Xi2,…,XiP) denote the predictors vector for the ith subject, thus the absolute risk that a person free of the disease of interest at age a will develop disease before subsequent age α+τ can be written as where subscript 1 denotes the disease of interest (such as stroke), subscript 2 denotes the competing events (such as non-stroke death), λ10(t) is the baseline hazard for subjects at age t, rr1(t|X) is the relative risk for subjects given covariates X at age t compared to a subject at the reference level. rr1 (t|X) varies with the increase of age even if the risk factors X are constants, since age is interacted with the risk factors. Furthermore, rr1(t|X) definitely changes if X(t) varies with age t. In practice, rr1 (t| X) are commonly assumed to be constants in a specific time period, and the competing risk is assumed to be independent with X(t). Thus rr1(t|X) can be estimated by cox regression or logistic regression.λ10(t) is estimated by λ10(t)= [1-AR(t)λ10*(t), where λ10*(t) can be estimated form age-specific incidence rate of the disease of interest. In the regression-based disease risk prediction model, multivariate logistic regression model is established between p risk factors and disease. The relative risk of the ith subject is calculated by rr1=Πrj,where rj(j=1,2,……,p) is the odds ratio for the jth risk factors. The above relative risk is estimated by adding the linear independent effect of each risk factor directly, which ignores the network interaction information between the predictors. Thus it is bound to the loss of the predictive power.Hence, we proposed a network structure-driven strategy for disease risk prediction. We first constructed a Bayesian network between the predictors and disease phenotypes, and defined the individual relative risk of Bayesian network for a subject given specific predictors as X02,...,X0N denote the baseline reference risk factors level, and Xi1,Xi2,...,XiNis the risk factors level of the ith subject. Simulation studies have been conducted to evaluate the performance of the BN-based model compared with the traditional LRT-based model under different scenarios. E/O ratio and AUC were adopted to evaluate the ability of calibration and discrimination. Further application on type 2 diabetes prediction model was conducted to assess the proposed model using the Shandong multi-center health physical examination longitudinal cohort established by our group.Results:1. Simulation results showed:1) both the E/O ratio and AUC of the network structure-driven disease risk prediction model (BN-based model) and the logistic regression-based risk prediction model (LRT-based model) varied quite slightly when sample size was larger than 1000. Although the discriminatory ability (AUC) was quite similar, the calibration ability (E/O ratio) of LRT-based model was apparently higher than 1, indicating an inferior performance to the proposed BN-based model.2) The BN-based and LRT-based methods tended to underestimate and overestimate the absolute risk with the increase of the effect of the predictors, but the E/O ratio of BN-based model was superior to LRT-based model. The AUC of the two methods increased with the effect of the predictors.3) The E/O ratio of the BN-based model and BN-based model were not affected by the correlation of the predictors, and the BN-method outperformed the BN-based model in calibration. The AUC of the two methods both increased with the correlation of the predictors.4) the BN-based model still worked superior than the LRT-based model in calibration ability. The AUC of the two methods did not obviously vary with the time length, and they still had quite similar performance in discrimination.5) The BN-based model was not sensitive to the cumulative incidence, while the LRT-based model is affected by the cumulative incidence level of the disease. The E/O ratio estimated by BN-based model got worse with larger cumulative incidence rate, while the BN-based model seems stable and superior to BN-based model. The two methods still had quite similar and stable performance in discrimination.2. In the application analysis, based on Shandong multi-center health physical examination longitudinal cohort, both BN-based and LRT-based risk prediction model for developing type 2 diabetes were built using data from residents who visited the Shandong State Established Hospital (SSEH) for regular physical examinations (757 of 7381 subjects developed type 2 diabetes during the 5 year follow-up). External validate cohort from routine physical examinations in the Affiliated Hospital of Jining Medical University (HJMU) were used to estimate the 5-year absolute risks (233 of 4142 subjects developed type 2 diabetes during the 5 year follow-up). The E/O ratios of BN-based model and LRT-based model were 0.93 and 0.89 respectively, indicating the BN-based model had higher calibration ability.The two methods had a good and similar discriminatory performance with AUC equaling 0.699 and 0.701 respectively. It can be seen that the application results are consistent with the simulation results.Conclusion:Both simulation and application showed that although the network structure-driven disease risk prediction model and the traditional logistic regression-based prediction model had little difference in discriminatory ability (AUC), but the calibration (E/O ratio) of the BN-based model was better than the LRT-based model. It shows that the prediction models ignoring network structure inevitably lead to the decrease of predictive accuracy.Innovation:We proposed a network structure-driven disease risk prediction model incorporating the network structure between the predictors and the disease phenotypes based on competing risk model, which improved the predictive accuracy and provided a novel method for disease risk prediction.
Keywords/Search Tags:Network structure, Biomarker, Variable selection, Disease screening model, Disease risk prediction model
PDF Full Text Request
Related items