Font Size: a A A

Probabilistic models in computational molecular biology applied to the identification of mobile genetic elements and gene finding

Posted on:2010-08-23Degree:Ph.DType:Thesis
University:Indiana UniversityCandidate:Rho, MinaFull Text:PDF
GTID:2440390002475584Subject:Biology
Abstract/Summary:
Advances in sequencing technology are producing an enormous amount of sequence data that need to be analyzed. Efficient computational methods assist such efforts by identifying meaningful patterns encoded in the genomic sequences and complementing experimental efforts to establish structure--function relationships in biological systems. In this thesis, I describe new computational frameworks to identify mobile genetic elements (MEGs) and genes in DNA sequences.;MGEs are found in most eukaryotic genomes and constitute a significant portion of the genomes. Accordingly, the identification of MGEs and analysis of their dynamics are important for a better understanding of the structure and evolution of the host genomes and MGEs themselves. We have developed MGEScan-LTR, a de novo method to identify LTR retrotransposons, an important class of MGE that transpose through reverse transcription of RNA intermediates. As detailed in Chapter 2 of this thesis, MGEScan-LTR first identifies intact LTR retrotransposons by finding two highly similar subsequences with a certain distance in a given genomic sequence. In the next step, MGEScan-LTR identifies solo LTRs by first clustering LTRs identified in the previous step, and then searching against the whole genome using PHMMs built from these LTR sequence clusters. These frameworks were applied to indentify a large number of novel elements, which were subsequently analyzed to estimate the evolutionary history and relationships of MGEs.;Chapter 3 of this thesis describes MGEScan-nonLTR, a computational approach inspired by a generalized hidden Markov model (GHMM) to identify non-LTR retrotransposons in genomic sequences. In comparative studies using genome sequences obtained from four eukaryotic organisms, MGEScan-nonLTR found a significantly larger number of elements when a comparison was made with RepeatMasker using the current version of the RepBase Update library. We also identified novel elements in two other genomes, which have been only partially studied for non-LTR retrotransposons.;In metagenomics, gene finding can provide the opportunity to elucidate the activities and interactions of genes within environmental samples. Reconstruction of metabolic and signaling pathways that are specific to the environments can also be assisted by such efforts. In Chapter 4 of this thesis is described MetaGeneScan, a method that we developed using hidden Markov models of genes and non-coding regions in prokaryotic genomes. In particular, we incorporated error models of sequencing to significantly improve frame shift errors, allowing for insertion and deletion states between consecutive phases in match states. As a result, a better performance was observed in simulated sequencing reads when comparisons were made with existing methods such as MetaGene.
Keywords/Search Tags:Computational, Elements, Sequencing, Models
Related items