Font Size: a A A

A Causal GWAS Method For Fine-mapping Causal SNPs

Posted on:2022-03-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:X R SunFull Text:PDF
GTID:1480306608480064Subject:Fundamental Medicine
Abstract/Summary:PDF Full Text Request
Genome-wide association study(GWAS)has reported more than 140000 SNPs associated with more than 4000 diseases/traits since it was proposed in 2005.However,only a few of the reported SNPs were validated successfully.Reviewing the analysis strategies,methods and practical applications of GWASs,researchers used the univariate regression for marginal correlations firstly,then the genomic principal components regression for controlling population stratification,next the multivariate regression models with adjusted for covariates,and the final genome-wide finemapping susceptibility loci to explore the associations of SNPs and traits.The multivariate analysis of adjustment for demographic,lifestyle,and clinical covariates has been widely applied in GWAS analysis by now.However,this strategy has been questioned and widely discussed by statisticians.Both theoretical justifications and statistical simulations show that adjusting for these covariates cannot control the confounding bias,nor can it always improve the statistical power.On the contrary,the mistaken adjustment may induce bias.Unfortunately,these arguments and doubts have not attracted the attention of researchers.The multivariate analysis strategy seems to be the unshakable "standard analysis strategy" in today's GWAS analysis.GWAS is fundamentally designed to elucidate the "causal associations" between SNPs and diseases/traits,not the "associations".Therefore,researchers should establish the analytical strategy of GWAS under the theoretical framework of causal inference,and guided by the methodology of causal inference.In fact,the causal direction from SNP to disease/trait is quite clear.Theoretically,there is no demographic/clinical covariates can alter individual's born genotype.Therefore,the demographic/clinical covariates do not change any SNPs and confound the causal associations between SNPs and diseases/traits.This is the reason of using univariate marginal regressions in the and diseases/traits.This is the reason of using univariate marginal regressions in the initial GWAS analyses.However,population stratification is the confounding factor that leads to bias the causal estimations from SNPs to diseases/traits.Therefore,genomic control(GC),adjustment for genetic principal component analysis(PCA)and other strategies are undoubtedly used to eliminate the population stratification.Using this population-stratified control strategy,GWAS has successfully identified many novel susceptibility genes,which guide the discovery of new susceptibility biological mechanisms,and have been applied in clinical practice.This undoubtedly increases the confidence of post-GWAS.GWAS can not only be limited to the discovery of susceptibility genes,but also have a wider application space and value.In this study,we proposed a causal GWAS(CGWAS)strategy and method for finemapping the potential causal variants based on the following methods.We aim to provide new strategies and methods for fine-mapping genome-wide causal genetic SNPs.MethodsFirstly,we systematically reviewed the development of the strategies and methods in GWAS based on 3198 GWAS research papers published on the website of GWAS Catalog from March 2005 to July 2018 to find which kinds of covariates were adjusted and how many SNPs are validated.Combining the theory of counterfactual causal inference and causal diagram,we used theoretical derivation and statistical simulations to illustrate the consequences of adjustment for demographic,lifestyle and clinical covariates in GWAS.We considered the estimation bias and precision(absolute deviation,Variance,etc.),statistical performance(Type ? error rate,statistical power),and the ability of discriminant susceptibility loci(false discovery rate,FDR,true discovery rate,TDR and Matthews correlation coefficient,MCC)to evaluate the influence of the covariate adjustment.Furthermore,we analyzed the reasons of the consequences under different scenarios.Based on the mechanism of confounders in GWAS,we proposed the strategy of causal GWAS(CGWAS)with adjusting for SNPs associated with both target SNPs(LD)and disease/trait phenotypes.We used theoretical derivation and statistical simulation to illustrate the feasibility of this strategy.Based on the causal diagram theory,we propose the following algorithm for finemapping causal variants.Firstly,we use the marginal correlation/regression models to screen the associated SNPs.We order the associated SNPs by P-values in each independent regions of the genome.We set the SNP with minimum P value as the initial target SNPT,and including the remaining SNPs in turn into the bivariate regression model(Y=?TSNPT+?iSNPi)according to the P value from large to small in this region.SNPs with statistical significance are retained as targets SNPT subsequently.Then we construct the trivariate regression model(Y=?TSNPT+?k1SNPk1+?k2SNPk2)and continue the process until to meet the stopping criteria.Finally remained SNPs are the fine-mapped causal SNPs.We performed theoretical derivation and statistical simulation to prove the feasibility of the process.As the applications to validate our strategy,based on the open database UK Biobank(UKBB),we fine-mapped BMI,breast cancer and pan-cancer(i.e.all cancers as a comprehensive cancer phenotype).At the same time,we used the Causal Diagrambased Stepwise Fine-Mapping(CDSFM)and the four most widely used fine-mapping methods(generalized linear regression model(GLM),Lasso regression model(LASSO),GCTA and Bayesian variable selection regression model(BVSR))to finemap the susceptibility loci of the above traits.Using the functional annotation analysis and literature search,the fine-mapping results were compared to illustrate the advantages of CDSFM algorithm.Finally,as a practical application,based on the cohort of early diagnosis and early treatment of esophageal squamous cell carcinoma in Shandong Province and the whole genomic genotype dataset,the CDSFM algorithm was applied to fine map of the susceptibility loci of esophageal squamous cell carcinoma.Then we compared the results with the esophageal adenocarcinoma in the UK to find the genetic differences.ResultsIt has been a trend and "standard method" to adjust for demographic,lifestyle,clinical and other covariates in GWAS.Approximate 80%of GWAS studies adjust covariates using linear regression model or logistic regression model.Covariates such as age,gender,smoking,alcohol consumption,BMI,and blood glucose levels are most commonly used.However,less than 3%of the SNPs identified in these GWAS studies have been functionally validated.Although the roles of the target SNP(G),covariate(C)and disease trait(Y)as well as their relationships are extremely complex,given the initial aim of GWAS,we simplified and summarized all the scenarios into 15 causal diagrams according to the various roles of C.Both theoretical justifications and statistical simulation show that arbitrary adjustment for covariate C cannot improve statistical power,but can induce estimation bias in many cases.More seriously,it can lead to the increase of FDR,the decrease of TDR,and the adverse outcome of poor overall discriminant ability(MCC).The strategy of adjustment for SNPs,which are associated with the target SNP and Y,can not only obtain unbiased estimation,but also improve the statistical power,effectively reduce FDR and promote TDR,thus significantly improve the hit rate of the causal SNPs.Based on the above-mentioned strategy of adjustment for SNPs,the proposed finemapping strategy(CDSFM algorithm)can effectively reduce FDR and improve TDR by the stepwise conditional independence test without the constraint of the specific causal diagram models.The hit rate of capturing causal SNPs could be increased up to 90%with high statistical power.The overall performance is significantly better than the GLM,Lasso,GCTA,BVSR,and etc.As the applications,the proposed algorithm of CDSFM was used to fine-map the genome-wide susceptibility loci of the quantitative trait BMI,the qualitative traits breast cancer and pan-cancer.(1)Based on 48,982 associated SNPs initially screened by typical GWAS,703 SNPs were fine-mapped on 584 genes on 22 chromosomes using CDSFM,many of which have been validated.(2)Base on 2074 associated SNPs by typical breast cancer GWAS,40 SNPs were fine-mapped on 24 genes and 13 intergenic regions on 16 chromosomes by CDSFM.Twenty out of these SNPs have been validated.(3)Based on 1643 associated SNPs by typical pan-cancer GWAS,38 SNPs were finemapped on 26 genes and 5 intergenic regions on 16 chromosomes by CDSFM.Seventeen out of these genes have been validated.The proposed CDSFM algorithm was applied to fine-map the susceptible loci of esophageal squamous cell carcinoma in the high incidence area of esophageal carcinoma in Shandong,China.Results showed that 37 SNPs were fine-mapped to 15 genes on 8 chromosomes,among which 6 genes were highly correlated with esophageal squamous cell carcinoma.By using the same method,we fine-mapped the susceptible loci of esophageal adenocarcinoma in white British population.Results showed that 6 genes were mapped to 4 genes and 1 intergenic region on 4 chromosomes.The genetic susceptible loci of esophageal squamous cell carcinoma in the Chinese population and esophageal adenocarcinoma in the white British population did not overlap,suggesting that esophageal squamous cell carcinoma and esophageal adenocarcinoma have distinct genetic mechanisms.Conclusions(1)It is unreasonable to adjust the covariates such as demography,lifestyle,and clinical covariates in the GWAS.SNPs associated with both the target SNP and the disease/trait are recommended adjusting instead of the covariates.(2)Based on the strategy of adjusting for SNPs,we proposed a new strategy and method for fine-mapping(CDSFM algorithm)using the stepwise conditional independence test,which can effectively reduce FDR without the restriction of the specific causal graph model.TDR is also improved,and the rate of capturing causal SNPs has been increased up to 90%,with high statistical power.The overall performance is significantly better than the existing fine mapping methods such as GLM,Lasso,GCTA and BVSR.CDSFM can be a new strategy and method for fine mapping of genome-wide genetic susceptibility loci.(3)The CDSFM algorithm was applied to the fine-map genetic loci of quantitative trait BMI,qualitative traits breast cancer and pan-cancer.A series of unreported and reported genetic susceptibility loci were found.Esophageal squamous cell carcinoma in the Chinese population and esophageal adenocarcinoma in the white British population have no overlap in genetic susceptibility loci,suggesting that these two cancers have distinct genetic mechanisms.
Keywords/Search Tags:Genome-wide association analysis, Causal inference, Susceptibility loci, Fine-mapping
PDF Full Text Request
Related items