Font Size: a A A

Analysis of protein-protein interactions using multiple biological data sets

Posted on:2007-10-16Degree:Ph.DType:Thesis
University:University of Southern CaliforniaCandidate:Lee, HyunjuFull Text:PDF
GTID:2450390005486422Subject:Biology
Abstract/Summary:
With the availability of various large-scale biological data sets, it is crucial to develop systematic methods to unravel cellular activities at the protein level. In this thesis, we study protein functions and protein interactions, and proteins associated with phenotypes by integrating various biological data sets.; First, we develop a maximum likelihood estimation method that uses both protein localization and gene expression data to estimate the reliability of protein interaction data sets. By integrating both data sets, we can obtain more accurate estimates of the reliability of various interaction data sets than using a single data set.; Second, we use protein-protein interactions to understand protein functions. We develop a novel Kernel Logistic Regression (KLR) method based on diffusion kernels for protein interaction networks. We extend our model by incorporating multiple biological data sources. We show that the KLR approach significantly improve the accuracy of protein function predictions over the other models.; Third, we infer domain interactions using multiple biological data sources. We propose a new measure, the expected number of observed interactions for each pair of domains from protein interaction data. We score domain interactions based on protein interaction data from yeast, worm, fruity, and humans. We also incorporate information on pairs of domains that coexist in known proteins and on pairs of domains with the same function to construct a high-confidence set of domain-domain interactions using a Bayesian approach. As a result, a total of 2,420 high-confidence domain interactions are obtained and these domain interactions are used to unravel detailed protein and domain interactions in several protein complexes.; Forth, we develop a new method for prioritizing genes associated with a phenotype by combining gene expression and protein interaction data (CGI) based on a Markov Random Field framework. The method has been applied to three yeast gene expression data sets of compendium knockout, stress response, and cell cycle, together with various protein interaction data sets. We show that the CGI outperforms the Pearson correlation coefficient in prioritizing genes associated with a phenotype.
Keywords/Search Tags:Data, Interaction, Protein, Method, Develop
Related items