| Deciphering how genetic variants affect phenotypes and human diseases is the key challenge in genetics.Genome-wide association studies(GWAS)have detected thousands of SNPs(Single nucleotide polymorphism)associated with complex phenotypes and diseases.However,there still are some limitations for current GWAS:i)GWAS cannot pinpoint causal variants;ii)Noncoding SNPs cannot be well interpreted;iii)Linking genotype to phenotype from the molecular mechanism relying on GWAS only is nearly impossible.In this study,we integrated functional genomic data and made use of machine learning technologies to annotate SNPs in Arabidopsis thaliana(Arabidopsis)and Zea mays(Maize)at protein,RNA and DNA levels.By aggregating different levels of functional SNP effects,gene-level association analyses were performed for flowering time-related phenotypes in Arabidopsis and Maize,and hundreds of candidate flowering time-related genes were identified.Different levels of functional SNP annotation can provide new insights for linking genotype to phenotype and understanding genetic mechanism of complex traits.The main results are summarized as follows.Amino acid sequence-based SNP functional annotation at protein-level.Based on homologous amino acid sequence differences and deep representation learning,we identified 340,881 and 39,244 high-confidence deleterious variants in Arabidopsis 1001 genomes and Maize AMP(Association Mapping Panel)population,respectively.Population-level analysis showed that deleterious variants undergo strong negative selection comparing to tolerant variants.By integrating protein-level functional annotation with GWAS signals,causal variants and genes involved in important biological processes were identified.Translation initiation site-based SNP functional annotation at RNA-level.High-quality translation initiation site(TIS)data in Arabidopsis and Maize were constructed by integrating ribosome profiling(including Ribo-Seq and QTI-Seq).Subsequently deepTIS model was built based on high-quality TIS data with deep neural network.deepTIS achieved high accuracy with transcript-level measure:95.2%and 84.0%transcripts in Arabidopsis and Maize can be accurately predicted,respectively.Based on deepTIS,we identified 30,278 and 17,250 SNPs that are associated with TIS and discovered SNPs that would affect the formation of upstream ORF.deepTIS can infer the link between SNPs and phenotypes at TIS level.m6A modification-based SNP functional annotation at RNA-level.deepEA was constructed using m6A modifications from m6A-Seq data with random forest algorithm.deepEA showed superior performance with AUC of 0.969 and 0.954 in Arabidopsis and Maize,respectively.Based on deepEA,we identified 556 and 2,504 SNPs that would affect m6A modification,these functional SNPs can link genotype to phenotype at the RNA modification level.By integrating biological network analysis,we identified several functional important m6A-related genes.Transcription factor binding site-based SNP functional annotation at DNA-level.By integrating large-scale transcription factor binding site(TFBS)profiles(including ChIP-Seq and DAP-Seq),we constructed deep neural network-based multi-label TFBS prediction model deepTFBS.deepTFBS outperformed state-of-the-art machine learning methods when evaluated in Arabidopsis and Maize.Based on deepTFBS,we identified 852,904 and 9,704 SNPs that would affect TF binding ability.Integrating these SNPs with GWAS and eQTL can infer the relationship among SNP,TF binding,gene expression and phenotype,which in turn will contribute to the understanding of genetic mechanism of complex traits.Gene-level association analysis by integrating functional effects of SNPs.Comparing to traditional GWAS,gene-level association can directly mine genes associated with phenotypes.By integrating the functional effects of SNPs at the protein,RNA,and DNA levels,we performed gene-level association analysis with mixed linear model and identified 176 and 30 candidate flowering time-related genes in Arabidopsis and Maize,respectively.Several candidate genes in Arabidopsis and Maize have been experimentally verified to be associated with flowering time.These results indicate gene-level association is beneficial to the mining and identification of key functional genes of complex traits and causal variants.In summary,we performed systematically and comprehensively functional SNP annotation both for coding and noncoding regions,functional effects of SNPs at different levels were deposited in the FunSNPDB database(http://funsnpdb.omstudio.cloud).In this study,we developed bioinformatics methods which can provide convenience for researchers focusing on genetic mechanism of complex traits. |