Background: Lung cancer is the most commonly diagnosed cancer and the leading cause of cancer death worldwide.According to the latest report by the International Agency for Research on Cancer(WHO-IARC),the global new cases and deaths of lung cancer in 2020 ranked the second and the first of all malignancies,respectively.In recent years,due to the increasing aging of populations and the high rate of tobacco consumption,the incidence and mortality rate of lung cancer has occupied the first place for a long time in China.Lung cancer can be divided into small cell lung cancer(SCLC)and non-small cell lung cancer(NSCLC).NSCLC accounts for85% of all lung cancer cases,and it can be further divided into squamous cell carcinoma,adenocarcinoma,large cell lung cancer and so on.Epidemiological studies have shown that both environmental and genetic factors contribute to the risk of lung cancer,and tobacco exposure is the most important one.However,not all the smokers will develop into lung cancer,suggesting the role of genetic factors for it.The genetic susceptibility of complex diseases(including lung cancer)could be caused by germline variants,and single nucleotide polymorphism(SNP)is the most common one.With the rapid development of genomic technologies,genome-wide association study(GWAS)has been widely used in exploring the genetic susceptibility to complex diseases.Since the first lung cancer GWAS published in 2008,a total of 51 lung cancer susceptibility loci have been revealed.However,although GWAS has achieved great success in revealing the genetic susceptibility of complex diseases,it has some limitations.First,GWAS is penalized by an important multiple testing burden,making it impossible to find out all the genetic variants associated with the phenotype.Conventional GWAS is underpowered to detect all the heritability explained by SNPs,causing a phenomenon of ‘missing heritability’.Second,GWAS only detects the tag SNPs designed on the chips.As a result,the variants idendified by it may not be the causal ones,and most of them are located in non-coding regions.It is difficult to determine the corresponding causal genes as well.Therefore,the use of effective statistical methods to conduct in-depth mining of GWAS and the exploration of the underlying mechanisms of the identified loci are the two main tasks in the post-GWAS era.Accumulating evidence has suggested that some of GWAS reported variants are located in transcriptional regulatory regions,and most of them are expression quantitative trait loci(e QTL),which may regulate the expression of surrounding genes.Based on this,researchers tried to combine genome-wide association studies with overall gene expressions for e QTL mapping,providing a general tool to interpret the results of GWAS.At the same time,the Genotype-Tissue Expression(GTEx)project carried out a systematic e QTL analysis in multiple normal human tissues,and established a comprehensive catalog to study the associations of genetic variants and gene expressions.In the present study,based on the existing GWAS data of Asian populations,we systematically evaluated the associations between e QTL SNPs(e SNP)and lung cancer risk.With the rapid achievements of publicly available databases of regulatory elements,transcriptome-wide association studies(TWAS)have been proposed as a principled approach to integrate e QTL with GWAS summary statistics to explore gene-trait associations.As a gene-based strategy,it could identify the most likely target genes within the reported susceptibility loci,as well as novel loci missed by GWAS study due to insufficient statistical power.The application of TWAS has brought new insights into the genetic basis of many complex diseases and traits.Recently,Bossé et al.performed a lung cancer TWAS in European populations,in which they revealed a new lung adenocarcinoma susceptibility locus at 9p13.3 and mapped 17 candidate target genes in GWAS reported loci.However,most of these existing studies identified the gene-trait associations based on a single tissue,ignoring the substantial sharing of local expression regulation across tissues,thereby limiting the effective sample sizes in developing imputation model.In addition,a hypothesis-free search across genes and tissues increases the burden of multiple testing and thus reduces statistical power.Furthermore,reports have shown that e QTLs with large effects tend to regulate gene expression in multiple tissues.Thus,cross-tissue analysis could improve the imputation efficiency and accuracy of TWAS,which had been successfully applied for epithelial ovarian cancer,prostate cancer and other diseases.In this study,we conducted a cross-tissue TWAS based on the genetic-expression matrix of 44 tissues in 450 individuals from the GTEx project,as well as our large-scale lung cancer GWAS(13,327 cases versus 13,328 controls)in Chinese populations.For the identified genes,colocalization analysis was performed to evaluate whether GWAS and e QTL signals were shared in the same locus.Finally,functional annotation and phenotype assays were conducted to validate the role of putatively causal genes in the carcinogenesis.Part Ⅰ.Genome-wide analysis of expression quantitative trait loci identified novel susceptibility variants for lung cancerMethods: A two-stage case-control study was conducted in this study.In the discovery stage,we included results from the NJMU GWAS dataset(including 2,331 lung cancer cases and 3,077 controls)and the FLCCA GWAS dataset(including4,796 lung cancer cases and 3,741 controls).In the replication stage,we included1,026 lung cancer patients and 1,006 controls frequency-matched to cases by age,gender and geographic regions.eSNPs were identified in lung tissues based on the GTEx database(V6p release)in278 normal lung tissues.A total of 757,615 e SNPs were significant after FDR correction(cis-e QTLs,FDR correction P-value < 0.05),and 477,190 ones of them were available in the NJMU GWAS dataset and FLCCA GWAS dataset.SNPs were filtered out for those with:(1)SNPs in previously reported loci;(2)SNPs in MHC region;(3)call rate < 95%;(4)minor allele frequencies(MAFs)< 0.01;(5)showing departure from Hardy-Weinberg equilibrium(HWE)in the controls(P-value < 1 ×10-6);(6)INFO < 0.8.After that,418,112 e SNPs were remained.Of these variants,we selected e SNPs for further replication according to the following criteria:(1)P <0.01 for both NJMU and FLCCA study;(2)P-meta < 1.0 × 10-4;(3)consistent in both GWAS datasets(heterogeneity test I2 < 0.75);(4)only the SNP with the lowest Pvalue was selected when multiple SNPs were observed in strong linkage disequilibrium(LD,r2 ≥ 0.5).Finally,11 SNPs were selected for further replication.SNPs were genotyped using the i PLEX Sequenom Mass ARRAY platform in the validation stage.The associations between SNPs and lung cancer risk were demonstrated using additive model by calculating odds ratios(ORs)and 95% confidence intervals(95% CI)using logistic regression adjusted for age,sex,principal components and pack-years of smoking or smoking status(if adaptable)in each GWAS and the validation dataset.Results from each study were combined with fixed-effect model in meta-analysis.Then,subgroup analyses were performed by age,gender,smoking status and histology for the identified e SNPs.We further performed the gene-environment interaction analysis with a regression model including e SNP,smoking status and the interaction term simultaneously with adjustments of age and gender.Finally,we performed functional annotation and enrichment analysis for the identified e SNPs.Results: After evaluating the association assuming an additive model,we found an e SNP rs505974 significantly associated with lung cancer risk in the same direction as that in the GWAS scan(OR = 0.88,P = 0.036);besides,we observed rs79589812 was marginally associated with lung cancer risk in the replication stage(OR = 1.43,P = 0.053).After combining the results from two stages,these two SNPs were significantly associated with lung cancer risk(OR = 0.90,P = 6.51 × 10-6 for rs505974;OR = 1.38,P = 2.45 × 10-6 for rs79589812).In the subgroup analyses,the effect of rs505974 was much stronger in the ever smokers than that in non-smokers(P-heterogeneity = 0.013).However,no significant heterogeneity was observed for these two SNPs in other subgroups.The gene-environment interaction analysis showed a negative interaction for rs505974 and smoking(P-interaction = 0.041).Based on results from GTEx,we found C allele of rs505974 was associated with decreased expression of CLDN16(β =-0.24,P = 3.30 × 10-5).Furthermore,differential expression analysis showed CLDN16 was remarkably overexpressed in lung squamous cell carcinoma(LUSC,P = 5.47 × 10-5)but not in lung adenocarcinoma(LUAD,P = 0.090).Rs79589812 was significantly associated withthe expression of SPATC1L(β = 0.56,P = 4.50 × 10-9),which was significantly overexpressed in both LUSC(P = 2.43 × 10-5)and LUAD(P = 2.25 × 10-5)as well.After Bonferroni correction,we found the co-expressed genes of CLDN16 were enriched in chemical carcinogenesis pathway and metabolism pathways related to cytochrome P450(CYP);while the co-expressed genes of SPATC1 L were significantly enriched in base excision repair pathway.Conclusions: After systematically evaluating the association between e SNPs and lung cancer risk,we identified two potential e SNPs,rs505974 at 3q28 and rs79589812 at 21q23.3,significantly associated with the susceptibility to lung cancer.Furthermore,function annotation integrating the results of multiple public datasets suggested several genes regulated by these two SNPs might play important roles in the development of lung cancer.Part Ⅱ.Cross-tissue transcriptome-wide association study identified novel susceptibility genes for non-small cell lung cancerMethods: A cross-tissue TWAS was performed using UTMOST with genotype and normalized gene expression data from 450 individuals in the GTEx project and our large-scale lung cancer GWAS.All the participants were from our published GWASs,including NJMU Global Screening Array(GSA)project(10,248 cases and 9,298 controls),NJMU Onco Array project(953 cases and 953 controls)and NJMU GWAS project(2,126 cases and 3,077 controls).In addition to the analysis of NSCLC synthetically,stratified-analysis based on histological classification were performed for LUAD and LUSC when available.In the analysis,we used false discovery rate(FDR)for multiple comparison corrections.To prioritize gene-level associations at each hit locus,we did a cross-tissue conditional analysis for genes within 2 Mb using UTMOST with significance level set to P ≤ 1.00 × 10-4.To reduce false-positive errors and obtain the significant associations for specific tissues,a traditional TWAS approach was performed with FUSION in each tissue for the identified genes by UTMOST.For each candidate gene,we can obtain the cross-tissue effect size with UTMOST and the association effect size in 44 tissues with FUSION.We selected thegenes for further analyses according to the following criteria:(1)genes with UTMOST cross-tissue FDR P ≤ 0.05;(2)genes were further validated by FUSION with FDR P ≤ 0.05 in at least one tissue.Then we performed a colocalization analysis using the R package ‘coloc’,which highlights the following posterior probabilities(PP)for consideration: H0(no causal variant),H1(causal variant for GWAS only),H2(causal variant for e QTL only),H3(two distinct causal variants)and H4(one common causal variant).A larger value for PP4 indicates a higher probability of a shared causal signal(PP4 > 0.75)and for PP3 indicates a higher probability of separate e QTL and GWAS associations(PP3 > 0.75).We then classified the identified susceptibility genes into different levels according to the existing evidence:(1)genes significant in lung tissue and with PP4 > 0.75 were sorted into level A;(2)genes significant in lung tissue but with PP4 ≤ 0.75 were classified as level B;(3)genes significant and with PP4 > 0.75 in non-lung tissue were defined as level C;(4)and lastly,genes significant in non-lung tissue but with PP4 ≤ 0.75 were categorized as level D.In addition to the data from the GTEx project,we also performed e QTL analysis for the genes significant in lung tissue using our published Nanjing Lung Cancer Cohort(NJLCC)project.Finally,we performed functional annotation and phenotype assays for these novel susceptibility genes.Results: In the cross-tissue TWAS analysis,we identified 35 genes reaching the statistical significance with a false discovery rate(FDR)correction PUTMOST ≤ 0.05.Conditional analysis for each susceptibility locus excluded 8 of them with Pconditional > 1.00 × 10-4.After further evaluating the associations in each tissue,we revealed 6 susceptibility genes in known loci and identified 12 novel ones.Among those,five novel genes,including ATR(Pcross-tissue = 1.45?×?10-5,PLung = 9.68?×?10-5),DCAF16(Pcross-tissue = 2.57?×?10-5,PLung = 2.89?×?10-5),GYPE(Pcross-tissue = 1.45?×?10-5,PLung = 2.17?×?10-3),PARD3(Pcross-tissue = 5.79?×?10-6,PLung = 4.05?×?10-3),and CBL(Pcross-tissue = 5.08?×?10-7,PLung = 1.82 ×?10-4),were significantly associated with the risk of lung cancer in both cross-tissue and lung tissue models.Further colocalization analysis indicated that rs7667864(C>A)and rs2298650(G>T)drove GWAS association signals at 4p15.31-32(OR = 1.09,95%CI: 1.04-1.12,PGWAS = 5.54×10-5)and 11q23.3(OR = 1.08,95%CI: 1.04-1.13,PGWAS = 5.55×10-5),as well as the expression of DCAF16(βGTEx = 0.24,PGTEx = 9.81×10-15;βNJLCC = 0.29,PNJLCC = 3.84×10-8)and CBL(βGTEx =-0.17,PGTEx = 2.82 ×10-8;βNJLCC =-0.32,PNJLCC = 2.61×10-7).Functional annotation supported the carcinogenic effect of these novel susceptibility genes in lung carcinogenesis.Furthermore,the knockdown of DCAF16 significantly suppressed cell viability,decreased the colony-forming ability and suppressed cell migration in A549 and SPCA1 cell lines.Similarly,5-ethynyl-2’-deoxyuridine(Ed U)incorporation assays showed that knockdown of DCAF16 significantly suppressed cell proliferation in A549 and SPCA1 cell lines.These results indicated that DCAF16 may play an important role in lung carcinogenesis.Conclusions: In the present study,we performed a cross-tissue TWAS analysis in lung cancer and identified 27 susceptibility genes,18 of which were validated by the following single tissue analysis,including eight susceptibility genes in lung tissue and ten in non-lung tissues.Colocalization analysis and functional annotations suggested that DCAF16 at 4p15.31-32 and CBL at 11q23.3 were probably novel susceptibility genes of lung cancer with consistent evidence,and these genes were also validated by in vitro functional assays.Taken together,we provided a candidate list of susceptibility genes for lung cancer in and outside of reported susceptibility loci.Furthermore,we highlighted the carcinogenic effect of two novel susceptibility genes at 4p15.31-32 and 11q23.3. |