Font Size: a A A

Statistical techniques for biological motif discovery

Posted on:2008-11-18Degree:Ph.DType:Dissertation
University:Cornell UniversityCandidate:Nagarajan, NiranjanFull Text:PDF
GTID:1445390005477947Subject:Computer Science
Abstract/Summary:
In recent years, the various genome sequencing projects and computational and experimental efforts to find genes have provided us with a wealth of sequence information in protein and DNA databases. A large portion of this sequence data is however yet to be characterized. Experimental efforts and manual curation have tried to keep up with the flood of data, but it has become increasingly clear that reliable computational methods are required to fill in the gap. In addition to its value in furthering research in basic biology, improved computational tools for annotating Proteomes and Genomes serve as an important first step in realizing the biomedical promise of whole-cell modelling and systems biology.; In this dissertation we discuss statistical and algorithmic techniques for two important areas in the field of biological sequence analysis. We begin by discussing our work on improving a class of motif finding tools that are widely used to discover regulatory signals in DNA. This work is based on new ideas in computational statistics that provide us with efficient and accurate tools for the analysis of motif significance. These tools make it feasible to incorporate a statistical score in motif finding algorithms and we show experimentally that this new approach can give rise to significantly more sensitive motif finders.; In the rest of this dissertation we discuss a new machine learning based approach for predicting conserved functional and structural units (or domains) in proteins. Finding domains in proteins is an important step for the classification and study of proteins and their role in interaction networks. Our proposed framework learns an expert definition of protein domains (to accurately model this concept) while avoiding the heuristic rules prevelant in earlier methods. Results from experiments on a large set of protein sequences validate the improved accuracy and coverage of our approach.
Keywords/Search Tags:Motif, Statistical, Computational
Related items