Font Size: a A A

Causal Network Structure Learning And Multiple Mediation Analysis Method Based On GWAS Summary Data

Posted on:2024-01-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:L HouFull Text:PDF
GTID:1524306917988649Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
BackgroundIn observational studies,it is important to explore the pathogenesis and prevention of diseases.The discovery of causal relationships between variables and the estimation of causal effects have always been two vital questions in causal inference.Controlling unmeasured confounders is always a difficult problem to be solved.Graphical models reveal the generating process of the observed data and they can be identified under the causal sufficiency assumption.For example,traditional Bayesian network algorithms,including constraint-based methods and score-based methods,cannot output robust causal network graphs in the case of effective control of unmeasured confounders.Assume that the causal structure among variables is known,there will be multiple causal pathways between any two variables.The estimation of these causal pathways’effects is the second concern.The multiple mediation analysis can be used to estimate the mediation pathways in the network,but it requires the validity of the sequential ignorability assumption,that is,there are no unmeasured confounders among exposure,mediators and outcome.This assumption is difficult to be satisfied in applications,and the negligence of important confounders may lead to serious causal effect estimation bias.Therefore,it is a major challenge to learn the network structure between variables from observational study data,to control the known and unknown confounders and approximate the true causal network,then accurately estimate the causal effect of the mediating pathway of interest.Genome-wide association studies(GWAS)based on large cohorts provide a large amount of publicly available GWAS summary data,which provides MR with rich data information and avoids time-consuming and labor-intensive data acquisition processes such as gene sequencing.Genetics variants can be regarded as instrumental variables(IVs)to infer causal relationship between an exposure and an outcome,named Mendelian randomization(MR).MR is used to control unmeasured confounders and avoid reversed causal relationships,which provide a new insight to causal discovery and PSE estimation.However,the causal discovery based on MR still stays at the level of univariable MR,that is,the level of total causal effect between two variables.It does not disclose the direct causal relationship between multiple variables in the network,that is,it does not pay attention to the causal role(mediator,confounder or collider)of variables in the network.Besides,mediation analysis based on MR is limited to single mediation analysis but does not consider the PSE estimation in the case of multiple mediators.Therefore,how to avoid the use of high-cost genetic data,make full use of a large amount of publicly available GWAS summary data,control the influence of unknown confounders,learn the causal network structure among multiple variables and estimate the effects of mediation pathways is a crucial problem that this study aims to solve.MethodsIn this study,based on the multi-source GWAS summary databases,a conditional causal network structure learning algorithm MRSL is proposed to relax the causal sufficiency assumption,and for the multiple mediating pathways connecting any two variables,we relax the sequential ignorability assumption and propose a PSE-MR method for the effect estimation of multiple mediating causal pathways.Through theoretical proof,statistical simulation and application analysis,we demonstrate the scientificity and validity of the proposed methods.(1)For MRSL method,in theory,conditional causal network learning algorithm is constructed.Its process is as follows:perform two-way MR in pairs to judge causal direction between variables→ obtain marginal causal graph→calculate topological sorting of marginal causal graph using depth-first search algorithm→define sufficient separating set in multivariate MR→remove supurious direct edges→iterate the previous step until the causal graph converges→output conditional causal graph.A series of lemmas and theorems are proposed based on the graph model,and theoretical proof is carried out to verify the science and rationality of the our causal network structure learning algorithm.The statistical simulation study included three scenarios:① Based on three adjustment variables(mediating,confounding and collision)in MVMR,the optimal tool selection strategy of multivariate MR was found by evaluating the bias,accuracy,type Ⅰ error rate and statistical power of causal estimation;②Based on the optimal tool selection strategy,the accuracy of learning network structure of MRSL and other eight published methods was evaluated by simulating random network diagram and fixed network diagram.The evaluation indexes included two parts:The evaluation of network learning(including precision,recall rate,F1 score,structural Hamming distance and computing time)and topological order calculation(including relative Spearman’s footrule and Kendall’s tau);③ The sensitivity analysis of MRSL was carried out,and the existing MR Methods were nested into MRSL,to verify the robustness of the algorithm by evaluating their performances under invalid IVs.(2)For PSE-MR method,at the theoretical level,given the causal network diagram among multiple variables,for the interested exposure and outcome variables,under the framework of the causal nested counterfactual model,combined with multivariate MR,based on the product method and Monte Carlo method,the causal pathway effect estimation model between exposure and outcome is constructed in the case of causal non-ordered and ordered madiators.In order to verify the science and rationality of PSE-MR,theoretical derivation is carried out in the following three situations:no pleiotropy,balance/directional pleiotropy and interaction between exposure,mediation and unknown confounder.The statistical simulation study includes the following four situations:①whether there exists pleiotropy,②whether there is bidirectional causality between exposure and mediators,③multiple mediators are misidentified,and ④ a mediating variable is missing.The evaluation indexes include the following two levels:① the accuracy of the estimation of causal effect,including relative bias and mean square error;② causal effect hypothesis test,including type I error rated and statistical power,to demonstrate the accuracy and robustness of this method.(3)In order to systematically and comprehensively evaluate the above MRSL&PSE-MR analysis process,based on the early diagnosis and early treatment cohort of esophageal squamous cell carcinoma(ESCC)in areas with high incidence of esophageal cancer in Shandong Province,China,and its epidemiological survey,whole genome genotyping and serum metabolite detection data were used to conduct GW AS analysis of alcohol consumption,serum metabolites and esophageal squamous cell carcinoma.GWAS summary data were obtained,and then the MRSL&PSE-MR analysis process proposed in this study was applied to learn the causal network structure between alcohol consumption,serum metabolites and ESCC,and to estimate the causal effect of the metabolic mediating pathway between alcohol consumption and ESCC,so as to clarify the pathogenesis of metabolic mediators between alcohol consumption and ESCC.Finally,KEGG metabolic pathway enrichment analysis was performed to verify the results.ResultsIn this study,using the multi-source GWAS summary databases,the structural learning algorithm MRSL and the multiple mediation analysis method PSE-MR are proposed to accurately construct the conditional causal network,and estimate the causal mediating pathway effect between the exposure of interest and the outcome.The results of theoretical proof,statistical simulation and application analysis all prove that the proposed method is scientific,effective and practical.(1)For MRSL,in the part of theoretical results,① For the marginal causal graph,in the aspect of directed edges,collision variables and topological sorting,two lemmas are proposed and proved by combining graph theory;② For the sufficient separating set in MVMR,three strategies are proposed:the set of variables on all paths between Xp and Xq,the union of the minimum adjusting set between Xp and Xq and the mediators,the set of variables except colliders between Xp and Xq;③For conditional causal graphs,the core theorem of removing supurious direct edges is proposed,that is,under the Causal Markov condition,Faithfulness assumption and three assumption of MR,for each edge Xp→Xq in the marginal causal graph GM,given a sufficient separating set ADJxp→xq such that Xp⊥ Xq|ADJxp→xq,which can be tested by adjusting for genetic associations with ADJXp→xq using multivariate MR,then there is no direct edge from Xp to Xq in the true causal graph G.The influence of unobserved confounders U can be eliminated by MR.The simulation results show that ① in multivariate MR,adjusting colliders can lead to bias in estimating causal effect.By comprehensively evaluating the accuracy of estimating causal effect and testing causal effect ability of MVMR in the case of adjusting confounding and mediating variables,the optimal Ⅳ selection strategy is to select the IVs that are strongly correlated with at least one variable in the combination of exposed and adjusted variables.In addition,the estimation bias of causal effect of adjusting colliders is minimal based on this Ⅳ selection.② Based on this Ⅳselection strategy,the simulation results of random and fixed graphs show that the F1 scores,precision,recall rate,structural Hamming distance and computing time of MRSL based on three sufficiently separating sets are better than those of the other eight methods.MRSL uses the third sufficiently separating set to perform the best and the most robust.Its F1 score is twice as high as that of other methods,and the computing time is only 1/100 of that of other methods.The results of relative Spearman’s footrule and Kendall’s tau both show strong consistency between the true and the estimated topological sorting.③The results of sensitivity analysis under invalid IVs show that when the proportion of invalid IVs is less than 50%,MRSL using inverse variance weighting as the main analysis method is better than MRSL nested with other pleiotropy/weak IVs-robust-MR Methods.(2)For PSE-MR,theoretical results show that,no matter non-ordered or ordered mediators,under the causal consistency assumption,component assumption and three core assumptions of MR,two situations are consideres:①PSE-IVW is proposed when there is no pleiotropy;②PSE-Egger is proposed when there is pleiotropy.If the number of IVs more than the number of mediators,PSE-MR can obtain unbiased estimation of the total causal effect,direct causal effect,indirect causal effect and causal effect of mediation pathways,and deduce the calculation formula of variance.When there is interaction of exposure and mediators on the outcome,indirect effects(product method)can be estimated unbiased,while total effects and direct effects need to be estimated by subgroup analysis.When there is interaction of multiple mediators on the outcome,direct effects can be unbiased estimated,while total causal effects and indirect effects need to be subgroup analyzed.When there is interaction of mediators and unknown confounders on the outcomes,only direct effects can be unbiased estimated.The results of statistical simlation show that①PSE-MR shows unbiased estimation of causal effect,small mean square error,stable type I error rate and high statistical power efficiency regardless of the existence of pleiotropy,and finds out the minimum number of IVs that can reach 80%statistical power in the case of different number of mediators.②When there is bidirectional causality between exposure and mediator,PSEMR can only obtain an unbiased estimate of direct causal effect.③If the sequence identification of the two mediators is wrong,the estimation of the total causal effect,direct and indirect causal effect is still unbiased,but the estimation of the mediation pathway effect will be affected.④ When one mediator is missing,the absence of upstream mediator variables will lead to the violation of the cross world ignorability assumption,leading to the biased estimation of causal effect in the ordered mediators case.(3)MRSL&PSE-MR analysis process was used to construct the conditional causal network of alcohol consumption,45 serum metabolites and ESCC,and two key downstream intermediate metabolites of alcohol consumption affecting ESCC were found:Carnitine and glyceryl phospholipids metabolites PC(18:0/0:0)and PC(14:1/22:2),as well as key upstream intermediate metabolites:benzene ring compounds(3,4-dihydroxyphenylacetic acid and sodium 4-hydroxybenzoate).Other serum metabolites,including 14 glyceryl phospholipids metabolites,1 fatty acid metabolite,1 steroid metabolite,2 benzene compounds,1 indole metabolite and carnitine metabolite,play important mediating roles in the development of ESCC from alcohol consumption.Through KEGG metabolic pathway validation,two mediating pathways were found in this study:①Alcohol consumption affected the occurrence of esophageal squamous cell carcinoma by influencing primary cholate synthesis,phosphonate and phosphonate metabolism,fatty acid synthesis,triglyceride metabolism and linoleic acid metabolism in turn;②Alcohol consumption affects the occurrence of esophageal squamous cell carcinoma by influencing tyrosine metabolism,triglyceride metabolism and linoleic acid metabolism in turn.ConclusionUnder the guidance of graph model and nested counterfactual model,make full use of publicly available GWAS summary data instead of individual genetic data,and utilize the natural advantages of genetic variation,control confounding variables,and relax the causal sufficiency assumption and sequential ignorability assumption that traditional methods have been relying on.A causal structure learning algorithm MRSL and a multiple mediation analysis method PSE-MR are proposed and applied to expore the causal relationships among alcohol assumption,serum metabolites and ESCC in Chinese.(1)By combining MR with graph theory,we proposed the causal network structure learning algorithm MRSL.Compared with existing methods,the accuracy and efficiency of the algorithm have been greatly improved.Through theoretical proof,statistical simulation research and case analysis,the science,effectiveness and robustness of this algorithm are verified.(2)Given a causal graph model,combining MR with multiple mediation analysis,PSEMR method is proposed to accurately infer the total causal effect,direct and indirect causal effect,as well as the estimation of the causal effect of intermediary pathway.Through theoretical proof,statistical simulation research and case analysis,the science,effectiveness and robustness of this algorithm are verified.(3)The proposed MRSL&PSE-MR analytical process was applied to the study of the causal relationship between alcohol,serum metabolites and ESCC in the Chinese,and the key upstream and downstream metabolic mediators in the mediation pathway between alcohol and ESCC were found,as well as the two-part mediation metabolic pathway verified by KEGG.It provides suggestions for the metabolic causal mechanism between alcohol and ESCC.
Keywords/Search Tags:GWAS summary data, causal structure learning, multiple mediation analysis, alcohol, ESCC, serum metabolites
PDF Full Text Request
Related items