Font Size: a A A

Omnigenic Mendelian Randomization Method And Application Research

Posted on:2022-05-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:L WangFull Text:PDF
GTID:1484306311976509Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
BackgroundsCausal inference is an eternal theme of epidemiology research.However,due to the interference of confounding factors and adverse causality,the correlations between exposure and outcome obtained in observational studies are often unreliable.To determine the causal direction and control the confounding factors,Mendel ian Randomization(MR)uses genetic variation to estimate the causal relationship between exposure and the outcome of the observed data.With the development of high-throughput omics technology,Genome-Wide Association Studies(GWAS)based on large cohorts provide rich data information for MR.With the expansion of the application field of MR method,its theoretical method research is also deepening.Different MR methods have different instrumental variables selection strategies,pleiotropic effect modeling assumptions,and the methods of parameter estimation and statistical inference.However,all the existing MR methods select a small number of SNPs with strong effect strength as instrumental variables to avoid the complex linkage disequilibrium(LD)relationship between SNPs.Such selection strategy will inevitably lead to the reduction of statistical power and the problem of estimation bias.To overcome the drawbacks caused by the SNP selection strategy,a natural idea is to include all genome-wide SNPs into the instrumental variable set G;This is the newly developed omnigenic idea of instrumental variables(that is,assuming that all genetic variations in the genome contribute to phenotypic traits).However,after incorporating such a large number of genome-wide SNPs into the two-sample Mendelian randomization model,how to achieve efficient,accurate,and unbiased estimation of model parameters is a key scientific problem that must be solved,and also the core research content of this research.To do so,in the second part of this dissertation,we proposed the omnigenic assumption that the instrumental variable set G is consists of the whole-genome SNPs,which represents the full effect of genetic factor G on exposure X,and genetic factor G also have directly effect on outcome Y(that is,the horizontal pleiotropy could be exist);At the same time,two samples Mendel ian randomization can have overlapped samples between the exposure data and the outcome data.We proposed Omnigenic Mendelian randomization(OMR)model in this disseratation,which is a two-sample Mendelian randomization model based on the omnigeneic assumption of complex traits in genetics.In the current era of biological omics,the OMR model constructed in this research has broad application prospects in various cross-omics analyses.In the theoretical perspective,with the unique advantage of MR that the direction of causality could be determined,fusing the results of cross-omics MR analysis could construct a causal network of DNA?RNA?protein?metabolite?disease phenotype and therefore open the "black box" of exposure? disease outcome,which provides the support for systemic epidemiological cause network construction,drug target design,prevention or diagnosis methods formulation and evaluation.As an application case,in the third part of this dissertation,relying on the National Early Diagnosis and Treatment of Esophageal Cancer Project,a screening cohort for early diagnosis and treatment of esophageal cancer was established in Shandong Province with a high incidence of esophageal cancer.The OMR model constructed in this dissertation was used to analyze the causal effects of serum metabolites on esophageal squamous cell carcinoma(ESCC).By integrating the genome and metabolome,the causal association of serum metabolites?esophageal squamous cell carcinoma is determined.Omnigenic Mendelian Randomization(OMR)modelMethodsUnder two independent sample setting,we first adopt a composite likelihood strategy under the assumption that the genetic effects follow a normal distribution to estimate the effect of instrumental variable G on exposure X and instrumental variable G on outcome Y jointly.In addition,in order to account for the LD information that exists widely in the genome,we construct the composite likelihood function with the LD information as the weight of the marginal likelihood function,and therefore the composite likelihood function is weighted.In the case of two overlapping samples and one sample,the covariance term of the sample is plugged in the above model,therefore the influence of correlation of overlapping samples on parameter estimation could be effectively avoided.Furthermore,an EM-NR algorithm which combined Expectation-Maximization(EM)algorithm and Newton-Raphson(NR)algorithm is developed to estimate parameters efficiently.In order to speed up the convergence speed of parameter estimation in NR,the estimation of the EM algorithm is used as the initial value of the NR algorithm.Then,jackknife resampling is used to test the hypothesis of model parameters through resampling strategy.In order to comprehensively evaluate the effectiveness,accuracy and robustness of the omnigenic Mendelian randomization model(OMR model),this study designed a systematic statistical simulation experiment.Based on the realistic distribution of SNP genotypes in the population and its LD model,a composite objective and realistic simulation data set is generated;Setting the different sizes of heritability,pleiotropic effects,and causal effects of exposure X on outcome Y;and consider different genetic effect patterns(including all SNPs in the genome have genetic effects,1%of the SNPs in the genome have genetic effects,and 10%of the SNPs in the genome have genetic effects,etc.).Under the above combination of various conditions,statistical simulation experiments are used to evaluate the parameter estimation bias(coverage rate),the stability of type I error,and the power of the OMR model.In addition,in order to compare the advantages and disadvantages of the OMR model proposed by this research with the current latest MR analysis model,we compared OMR with IVW,Egger regression,MRMix,BWMR and CAUSE.Finally,in order to evaluate the effect of the above model on the causal association analysis of real data,two types of real data sets are used.The first data set is set to the causal effect of exposure X on outcome Y as 1,including the categorical variable "cardiovascular disease(CAD)?CAD" and the continuous variable "height?height".The second data set is set as the causal effect of exposure X on outcome Y is not 1,including 20 quantitative traits?CAD and 20 quantitative traits?asthma.Results(1)Theoretical proof results:We developed the summary statistics based omnigenic Mendelian randomization(OMR)model for two independent sample?overlapped two sample and one sample setting,respectively,and further developed the EM-NR hybird algorithm,in which the initial estimated values of parameters are obtained through the EM algorithm;To speed up the convergence speed of parameter estimation,the initial estimate of the EM algorithm were used as the initial value of the NR algorithm,to perform rapid iteration;thus,the efficiency and accuracy of the model parameter estimation can be achieved.(2)Simulation results:1)Two independent sample setting:?type I error control:under the null settings,in the absence of horizontal pleiotropy effect,the OMR model can produce a reasonable or slightly inflated type I error control;In the presence of horizontal pleiotropic effeects,OMR becomes the only method that produces reasonably controlled type1 errors across genetic architectures and SNP heritability settings.?power:Overall,we found that the performance of OMR is robust and has the highest power across almost all simulation scenarios.The only setting where OMR does not perform the best is the extreme sparse setting where only 10 SNPs(the proportion of effect SNPs? 1/30000),affect the exposure.?Estimation accuracy:OMR produces accurate estimates for the causal effect and produce reasonable and accurate coverage rate.2)Overlapped two sample and one sample setting:?type I error control:n the presence of horizontal pleiotropic effects,OMR provided reasonable type I error control rate in both overlapped two sample and one sample setting.?power:OMR has the highest power in both overlapped two sample and one sample setting?Estimation accuracy:OMR produces accurate estimates for the causal effect in both overlapped two sample and one sample setting.(3)Real data application results:1)CAD--CAD and height?height analysis:among compared six MR methods,OMR is the only method that can detect a significant causal association in both cases,and produce 95%confidence interval contains the true value of 1.2)20 quantitative traits?CAD and 20 quantitative traits?sthma analysis:In both cases,OMR is able to detect the largest number of significant traits,and the significant associations can usually be verified by at least one other MR method.At the same time,most of the significant associations are further supported by clinical trials or literature evidence.Taking the Causal Association Analysis of Serum Metabolites on Esophageal Squamous Cell Carcinoma as an ExampleMethodsBased on the National Project for Early Diagnosis and Treatment of Esophageal Cancer,a total of 880 participants constituted the mGWAS study cohort(546 for data set 1 and 334 for data set 2);a total of 1,046 and participants constituted the esophageal squamous cancer case-control study cohort(including 969 controls and 77 ESCC cases).(1)In the mGWAS study of population with high ESCC incidence,the Infinium Omni2.5Exome-8v1-3(Illumina)chip was used to perform genotyping on participants'whole blood samples.Serum samples were analyzed by UHPLC-QTOF/MS for non-targeted metabolomics.After adjusting and standardizing all metabolite traits using covariates(age,gender,endoscopy results,and the first 10 principal components),a two-stage analysis strategy was adopted.In data set 1 and data set 2,The linear regression model was performed to analyze the correlation between approximately 4.2 million SNPs in the whole genome and 185 metabolite traits.meta-analysis was further performed to integrate the results from two analyses.(2)In the ESCC case-control study,the ESCC outcome was first adjusted and standardized using covariates(age,gender,and the top 10 principal components),and then a linear regression model was performed to analyze approximately 4.2 million SNPs in the whole genome and ESCC risk.(3)The summary statistics of 185 serum metabolites from mGWAS were treated as exposure variable,the summary statistics of ESCC from GWAS were treated as outcome variable,4,085,890 SNPs were used as ?s,and the OMR model was performed to examine the causal association between serum metabolites on ESCC.Results(1)The results of mGWAS analysis on population with high ESCC incidence:After statistical testing,a total of 4327 SNP-metabolite associations achieved whole genome and metabolome-wide significance level(Z test,P<5×10-8/185=2.70×10-1 0),including 19 independent SNP metabolite associations,involving 10 independent SNPs and 17 different serum metabolites.Of the 19 associations,7 successfully repeated the previously reported significant associations,and the other 12 had not been reported in previous studies.(2)GWAS analysis results on ESCC:After statistical testing,a total of 1 SNP reached the genome-wide significance level(5×10-8),and 42 SNP loci reached the suggestive significance level(1×10-5),after clumping,a total of 10 SNPs were determined as the top SNP.Among them,6 significant SNPs have been reported to be associated with ESCC,and the other 4 have not.Instead,there have been literatures discussing their association with other tumors or risk factors of ESCC(such as smoking,drinking,etc.),indicating that their relevance with ESCC need to be further explored.(3)OMR analysis results:A total of 11 serum metabolite traits were identified to have potential causal association with ESCC(P<0.05),of which 9 serum metabolite traits remained significant after Bonferrioni correction.Myristic acid,indole-3-pyruvate,hypoxanthine,CDCA and PC 18:1 are risk factors for ESCC.L-histidine,creatinine,PG 24:1,PC 41:6,PC 38:4 and PG 23:2 are protective factors for ESCC.Among them,myristic acid,CDCA and PC 38:4 are genetically influenced metabolites identified in the mGWAS analysis,indicating the candidate pathway of genome?serum metabolite?SCC.Conclusions(1)We developed a omnigenic Mendelian randomization(OMR)method for two independent sample?overlapped two sample and one sample setting,respectively,and relies on a computationally efficient composite likelihood framework and EM-NR hybrid algorithm for scalable inference.Simulation results have shown that,OMR model produces accurate causal effect estimates and reasonably calibrated type I error control while being more powerful than existing approaches across a range of simulation scenarios.OMR is implemented in the R package OMR.(2)Real data application results have shown that,under the situation where the true causal effect equals to one,OMR was able to detect a non-zero causal effects as well as produce the confidence intervals covered the truth of one;under the situation where the true causal effect is unknown,OMR was able to detect the largest number of statistically significant quantitative traits.Many of these identified causal associations by OMR have strong biological support from the literature.(3)In order to verify the practicability of the OMR method,firstly,through mGWAS analysis among population with high ESCC incidence,the GWAS summary statistics of serum metabolites were obtained,and then through GWAS analysis among ESCC case-control study,the GWAS summary statistics of ESCC was obtained.Finally,OMR model was performed to explore the causal association between serum metabolites and ESCC,11 serum metabolite traits were found to be statistically significant(P<0.05),of which 9 serum metabolites remained significant after Bonferrioni correction.
Keywords/Search Tags:Mendelian Randomization, Genomics, Metabolomics, Esophageal Squamous Cell Carcinoma
PDF Full Text Request
Related items