Two perspectives on biological sequence analysis

Posted on:2007-06-01

Degree:Ph.D

Type:Thesis

University:The University of Chicago

Candidate:Liu, Jing

Full Text:PDF

GTID:2443390005977995

Subject:Computer Science

Abstract/Summary:

In my thesis, a pair of reading glasses with two different lenses (filters), physical chemistry and computational linguistics, is used to "read" the large warehouse of biological sequences dataset.; We examine the PDBSELECT dataset, a non-redundant subset of PDB, and analyze the 3-D structure of the protein sequences with a physical chemistry lens. The role of pair-wise interactions between neighboring amino acids in protein sequences is examined by studying residue pairs whose sidechains are closely aligned in the sense that their initial (CA-CB) segments are nearly parallel. This small but significant fraction of residue pairs tends to be highly polar in composition, including many like-charged pairs. In addition, residue pairs with such closely aligned sidechains appear overwhelmingly in loops or at boundaries between different secondary structures. We examine the conformations of two different like-charged pairs in detail and show that each pair displays similar characteristic structural correlations which are different from what is found for the same pairs when their sidechains are not closely aligned.; The biological sequences are viewed as an extension of natural language and analyzed with a computational linguistics lens. We describe Baum-Welch and Viterbi training algorithms that can automatically extract context features of biological sequences. We do this by using unsupervised language acquisition techniques developed in computational linguistics. The key new element we borrow is the concept of a lexicon, a set of words of a language and a probability measure over the set of words. When tested on a English corpus, our Viterbi training algorithms shows comparable performance and higher efficiency than Baum-Welch algorithm. We also show that our Viterbi training algorithm is also efficient when applied to identifying large warehouse of biological databases and recognizing the eukaryotic promoter regions. Last, we review techniques for defining metrics on discrete or continuous spaces of measures and apply the dual metrics of Lipschitz/Wasserstein to measure the lexicon distances.

Keywords/Search Tags:

Biological, Computational linguistics, Different

Related items

1	Computational and functional analysis of growth hormone-regulated genes
2	Computational analysis on genomic variation: Detecting and characterizing structural variants in the human genome
3	Computational Identification Of Micrornas In Apple Expressed Sequence Tags And Validation Of Their Precise Sequences By Mir-Race
4	Integrated computational and experimental analysis of host-virus interaction systems
5	Computational and Genetic Screens for Regulators of Oxidative Phosphorylation
6	Applications Of Computational Fluid Dynamics In Mechanically Ventilated Multispan Plastic Greenhouse Research In North China
7	Computational analysis of human genomic sequence variation and Drosophila small RNA transcriptome
8	Structure-based Computational Retargeting of RNA Binding Proteins
9	Computational mapping and in vitro/in vivo evaluation of immunogenic epitopes for Lassa fever virus
10	Computational and experimental analysis of TAL effector-DNA binding