Font Size: a A A

Integrating experimental high-throughput transcript detection data into probabilistic gene finding

Posted on:2010-01-24Degree:Ph.DType:Dissertation
University:Washington University in St. LouisCandidate:Tenney, AaronFull Text:PDF
GTID:1448390002971898Subject:Biology
Abstract/Summary:
After determining an organism's genome sequence one of the most useful tasks is to search the genome for protein coding genes. A complete and accurate catalog of every variant of each protein-coding gene (the transcriptome) allows biologists to understand how the organism's macro scale traits (phenotypes) are related to its genome sequence. Finding genes in whole genomes, however, is difficult. Eukaryotic genomes are huge sequences, consisting of hundreds of millions to billions of nucleotides. The fraction of this sequence that defines protein-coding genes is very small (for example ∼25% in fly and only ∼3% in human) and scattered throughout the genome in short (∼100-200 nucleotide) subsequences called exons. In the last 10 years great advances have been made in solving the computational problem of finding genes in whole genome sequence (Brent 2008). These methods make use evidence from the genome sequence itself as well as homology information from related genomes and direct evidence of transcription in the form of aligned cDNA sequences.;Recently two powerful new experimental methods for directly querying the transcriptome have come into widespread use, whole genome tiling arrays and "next generation" short read sequencing technologies. How best to make use of the data provided by these methods for finding protein-coding genes, however, is unclear. This dissertation explores one way of combining traditional computational gene finding algorithms with these new data sources to produce better predictions of gene structures. The Conditional Random Field (CRF) based gene-fining program CONTRAST (Gross 2007) was modified to incorporate data from tiling arrays and short read sequencing. Several experiments involving these modifications will be presented using both simulated and real data. The existence of real tiling array and short read sequence data from the same D melanogaster cell line (Kc167) provides an opportunity to make a real world comparison of the usefulness of these data sources for finding genes.
Keywords/Search Tags:Data, Finding, Gene, Genome sequence
Related items