Epidemiology aims to study the distribution and determined factors of disease or health at population level, and studies the disease prevention, health promotion and measures. Exploring the risk factors of disease and inferring etiology is the eternal theme. However, in modern times, classical epidemiology are called "black box" epidemiology. The "black box" epidemiology methodology (Appendix Fig.1A) only identifies risk factors of disease, but fails to explain the pathogenic pathways concerning disease occurrence, development and prognosis. Although "black box" epidemiology has made great contribution in identifying risk factors and controlling disease, it is difficult to predict and assess the influence of intervention in the situation of unknowing pathogenic pathways concerning disease occurrence, development and prognosis, even fails to obtain convincing and repeatable conclusion and thus subjects to many criticisms and questioning. Therefore, for a long time, epidemiologists have been trying to seek opportunity to unlock the "black box" and thus clarify the pathogenic pathways or network.For the past few years, the development of a variety of high-throughput omics platform technology, such as Genomics, Epigenomics, Transcriptomics, Proteomics and Metabonomics etc, has had ability to map the whole omics biomarkers affecting the disease occurrence, development and prognosis into molecular network along the continuum of DNA→RNA→Protein→Metabolism→Disease phenotype and thus formed the framework of Integrated Systems Biology. Under this circumstance, as the high-throughput omics technology laboratory testing cost greatly reduced, epidemiologists adopted classical methods (e.g. cohort study, case-control study, ect.) to collect various exposure factors (e.g. living habits, dietary pattern and environment pollution, etc.) and further examined and analyzed high-throughput omics biomarkers, affecting the pathways of Genome→Phenome, Genome→QTL, Genome→PQTL, Genome→Metabolome (mGWAS), Epigenome→Metabolome (mEWAS), etc. These classical methods coupling with modern high-throughput omics platform technology generated a new epidemiology branch, Systems Epidemiology. Therefore, we proposed the discipline connotation and design framework of systems epidemiology (Appendix 1, Fig.1):Systems epidemiology couples the modern throughput omics platform technology with classical epidemiology methodology to test genome, epigenome, transcriptome, proteome, metabolome or phenome in the pathway from exposure to disease endpoint, and further combines the information of biology data sets KEGG (http://www.genome.jp/kegg/) to construct the pathogenic pathway "exposure factors→omics biomarkers→disease endpoint" and compares the statistical significance of distinct networks at different state (e.g. exposure group VS. un-exposure group, case group VS. control group) to infer the pathogenic pathways affecting disease occurrence, development and prognosis and estimate their causal effect, and thus provide scientific basis for further function verification, determining drug targets, prevention or treatment measures.To infer the path-specific effect of "exposure factors→omics biomarkers→disease endpoint", we need to solve a series of problems on study designs and data analysis.(1) From the perspective of study designs, though systems epidemiology can still draw lessons from classical epidemiology study designs (e.g. case-control deigns, cohort design, etc.), there exist complex regulation relations in the pathway of "exposure factors→omics biomarkers→disease endpoint" and thus render the identification and calculation of path-specific effect extremely difficult. We need to explore the approaches to solve these problems in the terms of causal inference theory.(2) In the level of omics biomarkers, although we can adopt classical statistical methods (e.g. t test, chi-square test, regression model, etc.) to screen biomarkers associated with disease, these methods are essentially association analysis, not causation analysis. Constructing the pathway or network of "exposure factors→omics biomarkers→disease endpoint", based on the screened biomarkers, can not reflect the pathogenic pathway, but will mix causality. Therefore, when screening omics biomarkers, we should screen the biomarkers associated with disease in the framework of causal inference and further construct the pathway (or network) of "exposure factors→omics biomarkers→disease endpoint".(3) For identification and calculation of the specific pathway of "exposure factors→omics biomarkers→disease endpoint", due to the existence of complex relations, we need to remove non-causation and identify the true causation, and further estimate the casual effect of specific pathway.To solve above three problems, this paper listed four works as follows.In chapter 1:we first introduce causal inference theory proposed by Judea Pearl and summary the fundamental theory and criteria of causal inference.In chapter 2:in the framework of causal inference, for the commonly used case-control designs in systems epidemiology, we discuss behavior and influence of the classical matching and regression strategy of classical case-control designs under complex network and further provide theoretical support for the use of match and regression adjustment.In chapter 3:for the biomarkers screening problems in high dimensional omics, we proposed MB-based Repeated-fishing strategy (MBRFS) based on the Markov blanket algorithm, and thus screened the biomarkers having potentially causation with disease endpoint, provided potential causal support for further constructing pathway of "exposure factors→omics biomarkers→disease endpoint".In chapter 4:for the identification and calculation of pathogenic pathways, as the hydrology causation of the water at a specific downstream point (e.g. estuary) of the river (e.g. Yangtze River) just comes from its conflux rather than its diffluent, we proposed path-specific effect statistic PSEM to provide new approach for the identification and calculation of pathogenic pathways in systems epidemiology.1 Causal diagrams theory (chapter 1)We firstly introduced causal diagrams theory proposed by Judea Pearl and summarized the fundamental theory and criteria on causal inference.(1) Causal diagrams (DAGs) consist of three elements:1) Variables (nodes, vertices); 2) Arrows (directed edges, arcs):possible direct causal effects.3) Missing arrows, sharp assumptions about absent direct causal effects. DAGs are non-parametric, i.e. they make no assumption about:1) The distribution of the variables (nodes) in the DAG; 2) The functional form of the direct causal effects (arcs). Besides, DAGs are "acyclic" in that they contain no directed ("The future cannot directly or indirectly cause the past").(2) A path is a sequence of non-intersection adjacent edges. Of note, path does not matter the direction of the arrows and can not cross same node more than one time. Path has three types:causal path (E→C→D), confounding path (E←C→D) and colliding path (E→C←D). Both causal path and confounding path are open path contributing to the association between variables, while colliding path are blocked path having no any contribution for linked two variables. Conditioning on the variables in causal path (mediators) will block the causal path and lead to over-adjustment bias; conditioning on the confounders in confounding path will obtain unbiased causal effect estimate; conditioning on the colliders in colliding path will open this previous blocked path and thus results in selection bias.(3) d-separation is a bridge linking statistical association and causal relation. A path P is said to be "d-separated" (or "blocked") by a conditioning set of nodes{Z} if: 1) P contains a causal chain X→Z2 →Y or a confounding paths X←Z3→Y such that the middle node M is in{Z}, or 2) P contains a colliding path X→Z1←Y such that neither the middle node Z, nor any descendant of Z, is in{Z}. On the contrary, a path P is said to be "d-connected" (or "unblocked" or "open") by a conditioning set of nodes{Z} if it is not d-separated.(4) The do-calculus proposed by Judea Pearl contains three rules:1) Insertion/deletion of observations:P(y| do(x),z,w)=P(y| do(x),w)if (Y (?) Z| X, W)Gx; 2) Action/observation exchange:P(y| do(x), do(z), w)= P(y| do(x), z, w), if (Y (?) Z|X, W)G_s; 3) Insertion/deletion of actions:P(y| do(x)/do(z),w)=P(y| do(x),w) if (Y J-Z|X,W)GZS(W). Where GX:the graph obtained by deleting from G all arrows pointing to nodes in X. Gx:the graph obtained by deleting from G all arrows emerging from nodes in X.(5) Back-door criterion is developed based mainly on confounding path. When estimating the causal effect from X to Y, we denote the path linking X and 7 as well as directing into X as back-door path. Blocking all back-door paths can calculate the total causal effect of Xon Yby P(y | z)= XLp(y | x, z)p(x Iz). While Front-door criterion is based on causal path and mainly applies to the situations in the presence of unobserved confounders. It need to satisfy three conditions:if 1) Z intercepts all directed paths from T to Y; 2) there is no back-door path from T to Z, that is all back-door paths from Z to Y are closed by T. If Z satisfies the front-door criterion relative to (T, Y) and if P(t,z)>0, then the causal effect of T on Y is identifiable and is given by formula:P(y | x)=ΣzP(z|x)Σx.P(y|x’,z)P(x’).(6) Instrumental variable G is used to estimate the causal effect X on Y when there are unobserved confounders. It need to satisfy three conditions:1) G1U; 2) G is strongly associated with X; 3) G(?)Y| X, U. We estimate the causal effect of X on Y by rGY /rGX in case of linear relationship between G and Y as well as G and X.(7) Markov Blanket of a target variable of M, MB(M), was defined as a minimal set given which the other variables were independent with M, i.e. all other variables are probabilistically independent of the variable M conditioned on the MB of variable M.2 Theory and methodology of matching and regression strategy based on causal inference (chapter 2)In systems epidemiology studies, inferring the causation of exposure→omics biomarkers, omics biomarkers→omics biomarkers, omics biomarkers→disease endpoint, etc. is the core of identifying pathogenic pathways and estimating path-specific causal effect. Although there existed complex relations among exposure, omics biomarkers and disease endpoint, inferring the relations among nodes of complex network can simplified into three key relations:causal path (E→C→D), confounding path (E←H→D) and colliding path (E→C←D). For inferring the causal effect of E on D, conditioning on C in causal path or colliding path will lead over-adjustment bias and thus distorts causal relation; conditioning on confounding factors from confounding path will remove confounding bias and thus accurately estimate the causal effect. However, for any three nodes of complex network, there exist 27 relationships in terms of topology. We, therefore, defined 9 key relations:a) C is a confounding factor of the exposure E on the outcome D; b) C is a common parent node of E and D with an absence of cause effect between E and D; c) C is only an independent cause of D; d) C causes E, but does not directly causes D; e) C is a common child node (i.e. collider) of E and D;f) C is just an effect of outcome D; g) C is only an effect of exposure E; h) C is an mediator from E to D; i) C is an instrumental variable (IV) for E and D. The causal effect (β) calculated through do-calculus and back-door criterion was regarded as gold standard. We use bias (β1-β2) and precision (SE(β1)) to assess the performances of different match and adjustment strategies in terms of theoretical proof and simulation.Results:(1) When C is just a confounding factor for exposure E and disease D, matching on it did not show obvious increase in the precision; the advantage of matching was to highly reduce the bias, though failed to completely eliminate the bias; further adjusting for C is still essential in matched case-control designs.(2) When C is related with E or D, but not a confounding factor, that is, independently causes D, causes E but does no directly causes D, a collider of E and D, an effect of exposure E, a intermediator of causal path from E to D. Arbitrary match or adjustment of this class of unreal confounders C will lead to unexpected bias.(3) When C is not a confounding factor but an effect of D, match or adjustment of C tend to be unnecessary.(4) In particular, when C is an instrumental variable, match or adjustment of C failed to reduce the bias because of the existence of unobserved confounding factors U.Conclusion:In the framework of systems epidemiology, exploring the causal effect of exposure (E) on disease endpoint (D), we need to take into account the relations of other factors of network with E and D; according to above theoretical proof and statistical simulation, properly using matching and regression adjustment strategy can accurately and precisely estimate the causal effect of E on D. Otherwise, arbitrary using matching and regression adjustment strategy will loss accuracy and precision.Innovation:For case-control design in systems epidemiology, in the framework of causal diagrams, we clarified the causal inference criteria using matching and regression adjustment strategy to cope with other factors of network by theoretical proof and statistical simulation.3 Omic biomarkers screening strategy based on conditional independence criterion (Chapter 3)In systems epidemiology studies, the precondition of identifying "exposure factors→omics biomarkers→disease endpoint" is screening the biomarkers having potential causation with disease endpoint. However, in classical omics data analysis (GWAS, MGWAS, etc.), statistical tests (t test, chi-square test, logistic regression, etc.) or variables selection based on machine learning (LASSO, SVM, Random Forest, etc.) are commonly used, but these methods are based on association analysis not causation analysis. Therefore, for the screening problems of high dimensional omics biomarkers, we proposed MB-based Repeated-fishing strategy (MBRFS) based on the Markov Blanket criterion and constructed novel approach on biomarkers screening for further forming "exposure factors→omics biomarkers→disease endpoint" pathway.Results:(1) In order to improve the power of classical algorithm (KS, GS, IAMB, MMMB, HITON-MB, DASSO-MB and FEPI-MB algorithm, etc.) and maintain its benefit, we updated this algorithm in three aspects. Firstly, for original high dimensional omics data, the initial screening by single by single statistical tests was performed before MB algorithm. This strategy not only reduced the computation burden, but identified as many biomarkers associated with phenotype as possible. Secondly, in order to reduce the number of empty cells of the hierarchical contingency table in G2 test, when choosing a new biomarker into MB, we relaxed the conditional independent test criterion by just conditioning on one order combination of biomarkers within MB. Finally, the proposed repeated-fishing strategy (MBRFS) was further used to maintain the power of G2 test and thus found more true positive biomarkers.(2) Three scenarios were designed to study the performances of different LD based on the "gain of function":1) in the first simulation scenario, we generated 8 independent phenotype-related SNPs via logistic regression model, and then they were randomly inserted into 8 different chromosomes, respectively; 2) we generated 8 correlated phenotype-related SNPs by logistic regression model with correlation coefficient 0.1 between each other, and then randomly inserted into 8 different chromosomes, respectively; 3) we randomly selected 8 phenotype-related SNPs from 8 different chromosomes, respectively with various MAFs and LD patterns;(3) A series of simulation result revealed that the true discovery rate (TDR) of MBRFS was almost close to null under null hypothesis (odds ratio= 1 for each SNPs) that is excellent stability in all three scenarios of independent phenotype-related SNPs in an absence of linkage disequilibrium (LD) among them, associated phenotype-related SNPs without LD among them, and phenotype-related SNPs with strong LD among them. As expected, under different OR and minor allel frequency (MAFs), MBRFS had the better performances in detecting the true positive biomarkers with higher MCC for all three scenarios above. More importantly, as proposed MBRFS used the repeated fishing strategy, it could capture more phenotype-related SNPs with minor effects, when non-significant SNPs under G2 test, after Bonferroni multiple correction.(4) In the real data analysis, the original GWAS data (491,883 SNPs) of Leprosy (706 case and 514 controls), DNA methylation and gene expression data of breast cancer and the schizophrenia metabolomics dataset, MBRFS had better performance than other methods.Conclusion:MBRFS could accurately detect the true positive biomarkers having potential causation associated with disease regardless of their degree of correlation between biomarkers and phenotype.Innovation:Based on Markov Blanket conditional independence criterion of causal diagram theory, our algorithm MBRFS could efficiently identify omics biomarkers having potential causation associated with disease endpoint to further construct "exposure factors→omics biomarkers→disease endpoint".4 Identification and calculation of pathogenic pathway effect based on do-calculus (chapter 4)The core of systems epidemiology is clarifying pathogenic pathway from exposure to disease endpoint and estimating their causal effects. Thus for the identification and calculation of pathogenic pathways, as the hydrology causation of the water at a specific downstream point (e.g. estuary) of the river (e.g. Yangtze River) just comes from its conflux rather than its diffluent, we proposed path-specific effect statistic PSEM to provide new approach for the identification and calculation of pathogenic pathways in systems epidemiology.Results:(1) For path-specific path in complex network, we proposed 5 rules to simplify the complex and extract path-specific from complex network. It is well-known that the water at a specific downstream point (e.g. estuary) of the river (e.g. Yangtze River) just comes from its conflux rather than its diffluent. Graphically, for any two adjacent upstream point and downstream point, there are 5 kinds of pathways, including 1) single conflux path, in this situation, it is unnecessary to adjust for it; 2) single diffluent path, in this situation, it is also unnecessary to adjust for it; 3) colliding path by two diffluents, in this situation, we cannot adjust for it; 4) confounding path by two conflux, in this situation, adjustment for it will obtain unbiased causal effect estimate; and 5) mediator path by diffluent and conflux, in this situation, we cannot adjust for it too.(2) When identifying the path-specific effect (E→M1→M2→M3→…→D), we proposed segmented series multiplication effect (PSE= Πi=1kARi=AR1·AR2,…,ARk), and efficiently solved the non-identification of path-specific effect in complex network.(3) We defined the statistics STATISTICPSE to identify and test the path-specific effect. It not only identify pathogenic path by hypothesis tests, but can estimate the path-specific effect. For the aim of comparing effects of multiple pathways, we also defined standardized statistic ARsPSE and RRSPSE and thus complete the identification and calculation of pathogenic pathway.(4) Statistical simulation revealed that above statistics with Permutation test had better performance than other methods and had ability in identifying and calculating path-specific effect.Based on the Bogalusa cohort study on cardiovascular disease and epigenetics biomarkers data, we analyzed the process of smoking affecting insulin signal pathway and further causing diabetes. We successfully identified a most significant pathway: Smoking-->SOCS-->INSR-->IRS-->PI3K--> FOXO1--> G6PC--> Glucose.Conclusion:A series of statistics we proposed could validly identify and accurately estimate the path-specific effect based on do-calculus, and had ability in comparing distinct pathways.Innovation:Based on the do-calculus proposed by Judea Pearl, as the hydrology causation of the water at a specific downstream point (e.g. estuary) of the river (e.g. Yangtze River) just coming from its conflux rather than its diffluent, we proposed path-specific effect statistic StatisticPSE to provide new approach for the identification and calculation of pathogenic pathways "exposure factors→omics biomarkers→disease endpoint" in systems epidemiology. |