Font Size: a A A

Two perspectives on biological sequence analysis

Posted on:2007-06-01Degree:Ph.DType:Thesis
University:The University of ChicagoCandidate:Liu, JingFull Text:PDF
GTID:2443390005977995Subject:Computer Science
Abstract/Summary:
In my thesis, a pair of reading glasses with two different lenses (filters), physical chemistry and computational linguistics, is used to "read" the large warehouse of biological sequences dataset.; We examine the PDBSELECT dataset, a non-redundant subset of PDB, and analyze the 3-D structure of the protein sequences with a physical chemistry lens. The role of pair-wise interactions between neighboring amino acids in protein sequences is examined by studying residue pairs whose sidechains are closely aligned in the sense that their initial (CA-CB) segments are nearly parallel. This small but significant fraction of residue pairs tends to be highly polar in composition, including many like-charged pairs. In addition, residue pairs with such closely aligned sidechains appear overwhelmingly in loops or at boundaries between different secondary structures. We examine the conformations of two different like-charged pairs in detail and show that each pair displays similar characteristic structural correlations which are different from what is found for the same pairs when their sidechains are not closely aligned.; The biological sequences are viewed as an extension of natural language and analyzed with a computational linguistics lens. We describe Baum-Welch and Viterbi training algorithms that can automatically extract context features of biological sequences. We do this by using unsupervised language acquisition techniques developed in computational linguistics. The key new element we borrow is the concept of a lexicon, a set of words of a language and a probability measure over the set of words. When tested on a English corpus, our Viterbi training algorithms shows comparable performance and higher efficiency than Baum-Welch algorithm. We also show that our Viterbi training algorithm is also efficient when applied to identifying large warehouse of biological databases and recognizing the eukaryotic promoter regions. Last, we review techniques for defining metrics on discrete or continuous spaces of measures and apply the dual metrics of Lipschitz/Wasserstein to measure the lexicon distances.
Keywords/Search Tags:Biological, Computational linguistics, Different
Related items