| Background:Lung cancer is the most commonly diagnosed malignancy in the world.According to the most current report from the International Agency for Research on Cancer(IARC),lung cancer is the leading cause of cancer morbidity and mortality worldwide in 2018.Because of the high rate of tobacco use and the heavy environmental pollution,the incidence rate of lung cancer ranks the first among males and the second among females in China.The heavy burden of lung cancer has become one of the major public health issues in China.Non-small cell lung cancer(NSCLC),including adenocarcinoma(AD),squamous cell carcinoma(Sq CC),and large-cell lung cancer,is the major histological type of lung cancer,which accounts for~85%of all lung cancer cases.Epidemiological studies show that both genetic and environmental factors contribute to the risk of lung carcinogenesis.Although tobacco smoking is generally considered as the major risk factor for lung cancer,less than 20%of the smokers will finally develop into lung cancer,suggesting a varying genetic predisposition to lung cancer among different individuals.Germline variants are generally considered the major contributor for the varying susceptibility to lung cancer,and single nucleotide variant(SNV)is the most common type.In the past decade,the rapid advancement in high-throughput genotyping technologies has enabled genome-wide association study(GWAS)as one of the most promising tools in elucidating genetic variants underlying the development of complex diseases and traits.Since the first lung cancer GWAS published in 2008,more than 30 GWASs have been conducted in European and Asian populations,and a total of 81 susceptibility variants in 51 loci have been identified to be robustly associated with lung cancer risk.However,there are still many challenges in elucidating the pathogenic variants in previously reported susceptibility loci.First,as most GWAS implicated variants are tag single nucleotide polymorphisms(SNPs)selected from the International Hap Map project,which represent for all co-inherited SNPs in the same haplotype,the direct inference of statistically associated SNPs rarely yields functional variants.In addition,more than 90%of the GWAS hits are located in the non-coding regions,which also makes it difficult to distinguish functional variants from non-functional ones.Second,based on the hypothesis of“common disease-common variants(CD-CV)”,microarray-based GWASs can only address how common variants predispose to complex diseases or traits;however,only a relatively small proportion(0.7%~2.4%)of lung cancer heritability can be explained by the common variants identified so far,which is called the“Missing heritability”.Thus,there are still two important biological questions needed to be addressed in the“post GWAS”era:how to distinguish pathogenic variants from others in the implicated susceptibility loci of lung cancer and how to identify rare pathogenic variants in the coding regions.Part Ⅰ.Comprehensive functional evaluation of genetic variants in susceptibility regions of non-small cell lung cancerBackground:In recent years,the emergence of multi-omics datasets,such as the Encyclopedia of DNA Elements(ENCODE)project,the Roadmap Epigenomics project,the Functional Annotation of the Mammalian Genome(FANTOM)project,the Genotype-Tissue Expression(GTEx)project,and The Cancer Genome Atlas(TCGA)project,provides a great opportunity to unveil the biological functions of disease-associated variants.Thus,in order to construct a systematic functional evaluation strategy for GWAS-implicated cancer risk loci and unveil the pathogenic variants and genes for lung cancer,we performed a comprehensive functional annotation of NSCLC susceptibility variants by integrating multiple bioinformatic algorithms as well as in-house and publicly-available databases.Methods:To evaluate the function of all susceptibility variants,we defined a group of credible risk variants(CRVs)for lung cancer in both previously reported risk loci and novel defined ones.For 81 previously reported variants,we first defined index variants as those met either of the following criteria:(1)minor allele frequency(MAF)≥0.01;and(2)variants in weak linkage disequilibrium(LD,r2<0.6)with each other.For other regions,index variants were defined if met one or more of the following criteria in our meta-analyses:(1)MAF≥0.01;(2)with a genome-wide significant P value<1×10-6 in the NSCLC,lung AD or Sq CC meta-analysis conducted in 27,120 NSCLC cases and27,355 controls;and(3)variants in weak LD with each other and previously reported variants(r2<0.01).Then,CRVs were defined as variants in strong LD(r2≥0.6)with above two groups of index variants in both previously reported and novel defined loci and physically within 500 kb upstream or downstream of the index variants.To define candidate target genes for CRVs,we first performed functional annotation of CRVs in the coding regions,promoters and enhancers,respectively,and then calculated a score for each gene-CRV pair representing for the coding impact or potential regulatory mechanisms(proximal or distal gene regulation)by integrating multiple lines of evidence.Each target gene was scored based on coding sequence,proximally regulation,and distally regulation.For CRVs located in the coding regions,one score was given if:(1)the CRV was a truncation variant;(2)the CRV was a missense variant and predicted to be deleterious by one of the six bioinformatics algorithms(CADD,FATHMM,LRT,Mutation Taster,Poly Phen-2,and SIFT);or(3)the gene was listed as a somatic driver gene for lung cancer.For the proximally regulated genes,one score was given if:(1)the CRV was located in the promoter region and was overlapped with promoter-related histone modification peaks(H3K4me3 or H3K9ac);(2)the histone modification peak that the CRV resided was also intersected with transcription factor binding sites(TFBS)of transcription factors(TFs);(3)the CRV was an expression quantitative trait loci(e QTL)for that gene;or(4)the gene was listed as a somatic driver gene for lung cancer.For the distally regulated gene,one score was given if:(1)the CRV was located in an enhancer element that predicted to physically interact with the promoter of the target gene by FANTOM5 or Pre STIGE;(2)the enhancer element containing the CRV overlapped with the TFBS of one or more TFs;(3)the CRV was an e QTL for that gene;or(4)the gene was listed as a somatic driver gene for lung cancer.Additionally,two scores were given if the CRV was located in an enhancer element that physically interacts with the promoter of that gene based on the Hi-C experiment.However,the score was down-weighted by multiplying by 0.1 if the gene was low expressed in normal lung tissues and tumor/adjacent samples(less than 1%samples with expression).Finally,we classified candidate target genes into four levels based on the integrated scores,while genes categorized into level 1 were supported by the strongest evidence,genes categorized into level 4 were considered with weak evidence.The enhancer elements and associated promoters defined by ENCODE,FANTOM5 or Pre STIGE,the enhancer-like and promoter-like histone modification peaks,the DNase I hypersensitive sites(DHS)and TFBS collected from ENCODE and Roadmap were used to evaluate the function of CRVs in the non-coding regions.In addition to the GTEx dataset(v7),we also performed e QTL analysis with data from our previous study and TCGA projects.Results:A total of 3,064 variants with r2>0.6 with one of the 67 index SNPs and within500kb upstream or downstream of the corresponding index variants were defined as CRVs and were included in the following analysis.Of the 67 index variants,58 were defined in previously reported susceptibility loci,and nine were novel defined ones,including variants in 2q21.3,4p14,4q27,6p22.1,8p23.1,9q31.3,11q23.3,13q24,and15q24.1.Of the 3,064 CRVs,39(39/3064=1.27%)were located in the coding regions,including 17 synonymous variants,two nonsense variants,and 20 missense variants.Most of the defined CRVs were in the non-coding regions,and showed a significant enrichment in promoter-like(H3K4me3 and H3K9ac)and enhancer-like(H3K4me1and H3K27ac)histone modification peaks,as well as DHS regions in normal lung tissues,lung fibroblasts or lung cancer cell lines.By integrating multi-omics annotation data,a total of 24 genes in 20 index-variant targeted regions were categorized into level1,84 genes in 34 index-variant targeted regions were categorized into level 2,394 genes in 61 index-variant targeted regions were categorized into level 3,and 218 genes in 52index-variant targeted regions were categorized into level 4.Of these genes,95 genes categorized as level 1 and 2 in 38 index-variant targeted regions were considered as the functional target genes for NSCLC,of which some well-known cancer driver genes were included,such as CASP8,BRCA2,and NRG1.Among these genes,the coding impact evaluation strategy aligned CRVs to 7 genes,the proximal regulatory gene mapping strategy matched CRVs to 37 genes,and the distal regulatory gene mapping strategy annotated CRVs to 73 genes.Further pathway enrichment analysis revealed the involvement of 26 pathways(Padj<0.05)in the development of NSCLC,including19 pathways related to immune function,such as interferon gamma signaling pathway(P=9.24×10-18)and PD-1 signaling pathway(P=5.48×10-15),and five pathways in the neuronal system that related to nicotinic acetylcholine receptors,and two pathways in Homologous recombination related DNA repair system.Conclusions:In this study,by integrating a large-scale genome-wide meta-analysis and multiple in-house and publicly available biological data,we constructed a systematic functional evaluation strategy for GWAS-implicated variants,and illustrated candidate pathogenic variants and genes for more than half of the susceptibility regions of lung cancer.These findings provided both a rich set of plausible gene targets for further functional studies and novel insights into understanding the biological underpinnings underlying the development of lung cancer.Part Ⅱ.Systematic evaluation of pathogenic germline variants in susceptibility genes of non-small cell lung cancer and cancer predisposition genes by whole-genome sequencingBackground:In recent years,with the rapid advancement in the next-generation sequencing technology,increasing evidence suggested that rare germline variants also play crucial roles in the development of complex diseases.Although the coding regions are well-known functional regions,most of the CRVs(3025/3064=98.73%)defined for NSCLC in Part I were located in the non-coding regions;thus,it remains largely unknown whether rare variants in the coding regions of susceptibility genes contribute to the risk of lung cancer.In addition,previous family-based studies have defined a group of well-established cancer predisposition genes(CPGs),where pathogenic variants in the coding regions are often observed among cancer patients.However,no study has ever evaluated the association of pathogenic germline variants in these genes with lung cancer risk.Thus,by using whole-genome sequencing(WGS)and functional annotation,we comprehensively evaluated the effect of rare variants in the coding regions of both lung cancer susceptibility genes as well as well-established CPGs on lung cancer risk.Methods:Whole-genome sequencing was performed on the whole-blood derived DNA from 1473 Chinese NSCLC patients and 1488 non-cancer controls.Systematic quality control(QC)process was performed to exclude samples with contamination,with a low coverage,with a low mapping quality,with a gender discrepancy,with unexpected duplicates or probable relatives,or with a high heterozygosity rate.Germline single nucleotide variants(SNVs)and small insertions and deletions(indels)were detected with Genome Analysis Toolkit(GATK)(v3.8)following the best practice.Variants were filtered out if they met one of the following QC criteria:(1)call rate<95%;(2)Hardy-Weinberg P<1×10-4;(3)heterozygote rate<0.6;and(4)minimum coverage≤15X.Candidate putative loss-of-function(Lo F)variants(nonsense,frameshift,and splice sites)in 95 lung cancer susceptibility genes defined in Part I and152 well-established CPGs,of which four were also defined as lung cancer susceptibility genes,with the estimated minor allele frequency≥0.05%in the overall cohort and≥0.05%in the Asian or the whole populations in the Genome Aggregation Database(gnom AD)noncancer set(v2.1.1)were then subjected to further functional evaluation.Variants from other categories of missense,intronic,or splice region and classified as pathogenic/likely pathogenic(P/LP)in Clin Var with two or more stars were subjected to similar functional evaluation criteria.All Lo F and Clin Var P/LP variants identified above were merged into a set of“pathogenic variants”.Differences of the prevalence of carriers of pathogenic variants in all CPGs or each single gene among NSCLC patients and non-cancer controls were assessed using Fisher’s exact test.The analyses were also stratified by age at DNA sampling,sex,smoking status and histological type.Cochran’s Q statistic≤0.05 was considered with heterogeneity.Results:For 95 lung cancer susceptibility genes defined in Part I,no significant difference of the rates of variant carriers among NSCLC patients(6.11%)and non-cancer controls(5.58%)(OR=1.12,95%CI=0.82-1.52,P=0.50)was observed.For 152well-established CPGs,a total of 206 pathogenic variants in 66 CPGs were identified in 192 NSCLC patients(13.03%),including 23 missense variants,78 frameshift variants,58 nonsense variants,43 splice site variants,and 4 variants in the splicing regions.The most frequently altered CPGs in NSCLC patients were SLC25A13(n=17),SBDS(n=15),GJB2(n=13),ATM(n=11),BRCA2(n=11),ATR(n=6),COL7A1(n=6),MUTYH(n=6),PTEN(n=6),BRCA1(n=5),FANCA(n=5),RAD51D(n=5),and TSHR(n=5).In controls,we detected 139 pathogenic variants in 136 samples(9.14%).The proportion of variant carriers of CPGs was significantly higher in NSCLC patients than that among non-cancer controls(OR=1.53,P=1.55×10-4).When stratified by age at DNA sampling and gender,the association between variant carriers of CPGs and NSCLC risk was significantly stronger among females than males(OR=2.03 vs.1.22,Phet=0.04),and among those with age≤60 than those with age>60(OR=2.20 vs.1.24,Phet=0.02).Of the 25 genes with more than two carriers among NSCLC patients,ATM(OR=11.14,P=3.25×10-3)and BRCA2(OR=3.71,P=0.03)showed significantly higher mutation rates among NSCLC patients than controls,while FANCA,FANCM,POLD1,TP53,POT,and WRN only mutated in NSCLC patients.Conclusions:In this study,we provided evidence that the rate of carriers of rare functional variants in well-established CPGs was significantly higher among NSCLC patients than that in non-cancer controls,while rare functional variants identified in susceptibility genes showed no significant association with NSCLC risk.These findings suggest that rare variants are also one of the major genetic risk factors for lung cancer,and the genetic mechanisms of rare variants underlying the development of lung cancer are different from the common ones.In addition,we proved that using WGS could efficiently identify pathogenic variants for lung cancer,which could help better understanding the genetic mechanisms of lung carcinogenesis.However,since there are no established criteria for evaluating the function of non-coding variants,whether rare pathogenic variants in the non-coding regions contribute to the risk of lung cancer remains to be further analyzed. |