Sparse Model Learning for Identifying Nucleotide Motifs and Inferring Genotype and Phenotype Associations

Posted on:2017-11-07

Degree:Ph.D

Type:Thesis

University:University of Miami

Candidate:Kuruppu Appuhamilage, Indika P

Full Text:PDF

GTID:2454390005998504

Subject:Bioinformatics

Abstract/Summary:

PDF Full Text Request

The primary functionality of the gene expression process is to convert information stored in genes into gene products such as RNAs or proteins. The fundamental of this complex process is controlled by a class of proteins known as transcription factors (TFs) that bind to special locations of the DNA double helix. These special binding sites, known as transcription factor binding sites (TFBSs), are generally short motifs of 6-20 base pairs. Furthermore, the discovery of new TFBSs will contribute to the establishment of gene regulation networks, diagnosis of genetic diseases and new drug design.;On the other hand, the genotype/phenotype relationship is mainly explained by multiple quantitative trait loci (QTLs), epistatic effects and environmental factors. A QTL is a section of DNA that correlates with variation in a phenotype. The QTL typically is linked to, or contains, the genes that control that phenotype interactions among QTLs or between genes, and environmental factors contribute substantially to variation in complex traits. During the last two decades the use of QTLs has proven to be effective for increasing food production, resistance to diseases and pests, tolerance to heat, cold and draught, and to improve nutrient content in animal and plant breeding.;Therefore, the objective of this dissertation is to develop sparse models for such high dimensional data, develop accurate sparse variable selection and estimation algorithms for the models and design statistical methods for robust hypothesis tests for the TFBSs identification and QTL mapping problems. Although the sparse model learning works presented in this thesis are used in the context of TFBSs identification or QTL mapping problems, the algorithms are equally applicable to a broad range of problems, such as whole-genome QTL mapping and pathway-based genome-wide association study (GWAS), etc.;The widely used computational methods for identifying TFBSs based on the position weight matrix (PWM) assume that the nucleotides at different positions of the TFBSs are independent. However, several experimental results demonstrate the dependencies among different positions. Recently, Bayesian networks (BN) and variable order Bayesian networks (VOBN) were proposed to model such dependencies and thereby improve the accuracy of predicting TFBSs. However, BN and VOBN model the dependencies in a directional manner, which may hinder their capability of completely capturing complex dependencies. To this end, we develop a Markov random field (MRF) based model for TFBSs capable of capturing complex unidirectional relationships among motifs. To capture the large extent of dependencies in a sparse model without causing overfitting, we develop a feature selection method that carefully chooses only the most relevant features of the model.;An exhaustive simulation study affirmed that our MRF-based method outperforms other state-of-the-art methods based on VOBN. To further reduce the computational complexity of our algorithm, we introduce a novel pairwise MRF model to the TFBSs, and develop a fast algorithm to learn the model parameters. Specifically, we adopt an optimization method that employs the log determinant relaxation approach to evaluate the partition function in the MRF, which dramatically reduces the computational complexity of the algorithm.;For the genotype/phenotype association problem, we develop a novel empirical Bayesian least absolute shrinkage and selection operator (EBlasso) algorithm with normal and exponential (NE) and normal, exponential and gamma (NEG) hierarchical prior distributions. Both of these algorithms employ a novel proximal gradient approach to simultaneously estimate model parameters that leads to extremely fast convergence. Furthermore, we develop a novel proximal gradient hybrid model capable of detecting more QTLs than its vanilla flavor, but still maintaining a lower false positive rate.;Having both covariance and posterior modes estimated, they also provide a statistical testing method that considers as much information as possible without increasing the degrees of freedom (DF). Extensive simulation studies are carried out to evaluate the performance of the proposed methods, and real datasets are analyzed for validation. Both simulation and real data analyses suggest that the new methods are fast and accurate genotype-phenotype association methods that can easily handle high dimensional data, including possible main and interaction effects with orders of magnitude faster than existing state-of-the-art methods. Specifically, with the EBlasso-NEG, our new algorithm could easily handle more than 105 possible effects within few seconds running on an average personal computer.;Given the fundamental importance of gene expression and genotype/phenotype associations in understanding the genetic basis of complex biological system, the MRF, pairwise-MRF, EBlasso-NE, EBlasso-NEG and EBlasso-NEG hybrid algorithms and software packages developed in this dissertation achieve the effectiveness, robustness and efficiency needed for successful application to biology. With the advancement of high-throughput molecular technologies in generating information at genetic, epigenetic, transcriptional and posttranscriptional levels, the methods developed here have broad applications to infer TFBSs and different types of genotype and phenotypes associations. (Abstract shortened by UMI.).

Keywords/Search Tags:

Model, Tfbss, Phenotype, Association, QTL mapping, Develop, Motifs, Methods

PDF Full Text Request

Related items

1	The Research Of Statistical Methods Based On Entropy Theory For QTL Mapping In Human Beings
2	FECG Extraction Methods Based On Non-linear Estimation Combined With BSS
3	A mixed methods study to develop an instrument to assess family quality of life among caregivers of adults with traumatic brain injury
4	Methods for detecting multi-locus genotype-phenotype association
5	Aedes aegypti vector competence and gene flow in Mexico. Association mapping software for testing candidates genes associated with a phenotype
6	Cost-utility Analysis Of Medication Regimens For Tic Disorders In Children And Adolescents Based On Scales Mapping Methods
7	The Study, Based On The Technique Of Epicardial Mapping Of Atrial Fibrillation Characterization Methods And Electrical Physiological Mechanism
8	Study Of The Role Of Methylation Of CpG Motifs And Its Regulation Factors In The Pathogenesis Of Systemic Lupus Erythematosus
9	Single nucleotide polymorphism (SNP) selection for genotype-phenotype association studies
10	Statistical methods for family-based association studies for complex human diseases: Single-locus and haplotype methods