Font Size: a A A

Pipeline Design And Application Based On Rare Variant Association Analysis

Posted on:2022-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:R J LiFull Text:PDF
GTID:2480306323965189Subject:Data Science
Abstract/Summary:PDF Full Text Request
With the development of DNA sequencing technology,Genome-Wide Associa-tion Studies(GWASs)have successfully identified more than two thousand pathogenic genetic loci,but the common variants found by GWAS only explain a small part of the heritability of SNPs for complex traits.Theoretical and empirical studies have shown that unexplained heritability is likely to be contributed by rare variants with higher ef-fect.In recent years,researchers have proposed some association test methods to im-prove the test effect of rare variants,and have also developed some software dedicated to rare variants.However,there are limited studies to provide systematic pipeline for identifying rare variants associated with disease.In this paper,an Effective Gene-based Rare Variant Association analysis(EGRVA)pipeline is proposed to identify rare variants related to disease in case-control studies.This pipeline uses GWAS data as input,and designs the steps of genotype imputation,quality control,function annotation,statistical analysis and bioinformatics analysis suc-cessively.In the genotype imputation stage,we use the large Haplotype Reference Con-sortium to infer the genotypes in order to improve the accuracy of the imputation.In the function annotation stage,the EGRVA pipeline implements two types of variant annota-tion.One is that the gene-based annotation selects exonic and splicing variants,the other is that the annotation based on LJB*(dbNSFP)database predicts the harmful scores of non-synonymous variants.In addition,we adopt the Efficient Resampling Sequence Kernel Association Test method and the Bayesian mixture model method respectively to identify risk genes for different annotation methods.We have successfully applied the EGRVA pipeline to the GPN-PBR dataset from a spontaneous preterm birth study and the ADNI-1 dataset from an Alzheimer's disease study,and identified four potential pathogenic genes(HRNR,PMS1,ATM,SLC22A25)for spontaneous preterm birth and one gene SIPA1L2 for Alzheimer's disease.Through a series of bioinformatics analysis,we obtain the biological explanation of risk genes and predict their pathogenic pathways.The bioinformatics results verify the effective-ness of the EGRVA pipeline.The successful application of the this pipeline in the two datasets is of valuable reference to the discovery of rare genetic loci in other complex diseases.EGRVA can be obtained from https://github.com/ruijiali/EGRVA.
Keywords/Search Tags:Rare variant, Pipeline design, Association analysis, Bioinformatics
PDF Full Text Request
Related items