Font Size: a A A

Association Studies On Missing Heritability Of Complex Phenotypes

Posted on:2020-05-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J GuoFull Text:PDF
GTID:1360330590972770Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Complex traits,being controlled jointly by multiple genes as well as environmental factors,have always been an important and challenging research topic in genetics.The study of these traits has important applications in prevention,diagnosis,and treatment of complex human diseases as well as in the selective breeding of crops.Since the completion of the sequencing of human genomes as well as the genomes of many common plants and animals,genome-wide association studies(GWAS)have been successfully applied to the study of many complex phenotypical traits/diseases,and has become one of the most important tool in the study of the relationship between genes and complex traits.However,despite the major success in uncovering controlling genes of complex human diseases,for most complex traits GWAS can only detect genetic variants that account for between5% to 30% of the variance attributed to genetic factors,which is called the “missing heritability problem”.Based on current research,some potential causes of this problem includes 1)insufficient statistical power of single locus analysis;2)lack of considerations of gene-gene and gene-environment interactions;3)insufficient study of the effect of rare genetic variants.Hence,using single nucleotide polymorphism(SNP)data,we will focus on the issues 1)and 2)above and propose four new strategies for multi-locus modeling as well as the modeling of epistasis and gene-gene interactions:(1)Multi-locus association analysis via combining linear mixed model and sparse group LassoIn order to deal with the deficiency in the statistical power of traditional single-locus GWAS,we proposed a method for multi-locus association studies based on linear mixed model and sparse group lasso.Firstly,in order to handle the high threshold due to multi-test correction which is unavoidable in single locus GWAS,as well as the inability of utilizing connections between loci,we used a multivariant linear model in association studies in order to fully utilize the combined information from multiple loci and strengthen the statistical power.Secondly,to deal with the potential false positives caused by confounding factors like population structure,we used the linear mixed model and model confounding factors via random effects.Lastly,based on the biological intuition that phenotypes should be related to a few SNPs in a few genes only,we incorporate prior knowledge on the location of the SNPs via sparse group lasso.Experiments in both simulated and real data show that our new multi-locus method performs well in both phenotype prediction and effective locus selection,and is a powerful tool for analysis studies.(2)Epistasis detection based on factorization machinesThe study of interactions between SNPs(epistasis)is an important topic in GWAS.Among the current method,statistically testing the interactions of all pairs of SNPs has a time complexity that grows quadratically with the number of SNPs and may have insufficient statistical power due to the use of multi-test correction,while randomized or heuristic search methods may not be able to find all epistasis.Hence,a major problem in the study of epistasis is how to reduce complexity while still considering the interaction of all possible combinations of SNPs.In this article we proposed an epistasis detection method based on factorization machines: firstly,do one-hot coding for SNP data to turn them into sparse features,then use these sparse data as input of the factorization machine to learn embedding vectors,then use the inner product of embedding vectors to represent the strength of interactions between pairs of SNPs.Our experiments show that compared with prior approaches,this method can detect epistasis more accurately and efficiently.(3)Gene-based nonparametric testing of interaction underlying qualitative traitsAs marker-level interaction studies may have high time complexity and low statistical power,recently interaction between genes(i.e.groups of SNPs in the same gene seen as one single feature)become another popular topic in GWAS.Here we propose a new strategy for the detection of gene-gene interactions based on a permutation strategy and distance correlation(dcor).Firstly,due to the fact that dcor has strong power for detecting nonlinear interaction,and has no constraint on the dimension of the two features,we use the difference of dcor in disease and control samples to characterize the strength of the interaction.Because there is little assumption on the exact form of interaction when measuring it via the difference of dcor,this method should have good performance under generalization.Furthermore,to get the significance of the dcor,we use a permutation strategy to estimate the distribution of dcor in the absence of interactions.Experiments on 8 simulated disease models as well as the real data on human rheumatoid arthritis(RA)shows that it has significant advantages over prior approaches for the detection of gene interactions.(4)Gene-based testing of interaction underlying quantitative traitsAs opposed to qualitative traits,quantitative traits are those that take continuous values among the individuals in a group.The lipid level in humans,the flowering time in plants,and the weight of grains are all belong to quantitative traits.Investigating the genetic variation related to lipid levels in the human is crucial for understanding the pathogenesis of cardiovascular and cerebrovascular diseases;and the flowering time and grain weight of plants are also closely related to the breeding of elite varieties.However,currently,researches on genetic interaction underlying quantitative traits are limited.Therefore,a test method based on U statistic and ensemble learning method was proposed to test the interaction of genes underlying quantitative traits.Firstly,aiming to model the nonlinear relationship in the gene-gene interaction,we introduce the ensemble learning method as our learning algorithm.To guarantee the ability to fully capture the different forms of interaction as well as good generalization performance,we choose tree model as the base learner.Using regression tree as base learner.Secondly,a special subsampling way is used to ensure the prediction made by ensemble method belong to Ustatistics that we could use its approximal normal distribution property to design statistics that modeling the strength of gene-gene interaction.Experimental results show that the proposed statistical test can effectively detect different forms of the gene-gene interaction underlying quantitative traits.
Keywords/Search Tags:Genome-wide association study, Single nucleotide polymorphism, missing heritability, multi-locus association analysis, epistasis, gene-gene interaction
PDF Full Text Request
Related items