Font Size: a A A

Homology identification for multidomain proteins

Posted on:2008-09-23Degree:Ph.DType:Thesis
University:Carnegie Mellon UniversityCandidate:Song, NanFull Text:PDF
GTID:2440390005978125Subject:Biology
Abstract/Summary:
Homology identification is the first step in many genome-scale computational analyses, including comparative mapping, phylogenetic footprinting, comparison of biological networks, genome annotation and analysis of whole genome duplication. Traditional homology identification methods based on sequence similarity fall short when applied to modular sequence families, which can have significant sequence similarity due to a shared domain despite having distinct evolutionary histories. Although additional criteria based on alignment length have been proposed to address this difficulty, this approach results in high error rates, as I demonstrate in this thesis. There have been two obstacles to developing better homology identification methods for modular sequences. First, there is no accepted model of homology for modular sequences. Second, benchmark datasets of known modular families are needed. However, currently there are no suitable datasets available.; In this thesis, I propose a formal model of modular sequence evolution. Using this model, I curated a benchmark dataset of mouse and human sequences drawn from twenty well-studied protein families. Using this dataset, I evaluated the performance of sequence similarity and alignment coverage in homology identification. Surprisingly, although these methods are widely used, they result in a large number of mis-assignments. In response, I propose two new homology identification methods for modular sequences. Neighborhood Correlation is a novel method based on comparison of sequence neighborhood, the set of sequences with significant matches to a query sequence. In an empirical comparison with traditional sequence analysis approaches on twenty hand-curated sequence families, I demonstrate that Neighborhood Correlation is more accurate and reliable. In particular, Neighborhood Correlation achieves high sensitivity and high specificity in complex modular families as well as in simple families with a single domain. Furthermore, Neighborhood Correlation is easy to implement, yielding an efficient, high-throughput method for modular homology detection. I also propose Domain Architecture Comparison to detect homology through explicit comparison of domain architecture. I developed several schemes for scoring the similarity of a pair of protein sequences by exploiting an analogy between comparing proteins using their domain architecture and comparing documents based on their word content. I evaluated the proposed methods using my benchmark dataset, demonstrating the effectiveness of comparing domain architecture to identify homology. My results also demonstrate the importance of both down-weighting promiscuous domains and of compensating for proteins with large numbers of domains.
Keywords/Search Tags:Homology identification, Domain, Comparison, Neighborhood correlation, Sequence, Modular
Related items