Font Size: a A A

Predicting protein-protein interactions and their interacting interfaces with statistical learning techniques

Posted on:2013-04-26Degree:Ph.DType:Dissertation
University:University of DelawareCandidate:Gonzalez, Alvaro JFull Text:PDF
GTID:1450390008483689Subject:Biology
Abstract/Summary:
Anywhere you look in the cell, you will see proteins at work. Proteins are molecular machines built in a multitude of shapes and sizes. They execute nearly all of the cell's functions. Typically proteins do not carry out their functions as isolated entities. They bind other proteins---other molecules in general---to create chemical factories with a definite spatial structure. With advances in genome sequencing scientists have made a good stride in defining what are all the proteins that may be found in an specific organism, i.e. a partial parts list of the organism's cellular components is now known. However, it is still far from clear how these proteins interact with one another to form the machineries that ultimately perform living functions in the cell.;High-throughput experimental methods, such as yeast two hybrid (Y2H) and mass spectroscopy (MS), have been developed to screen a large number of proteins in a cell and assess their potential interactions. Yet, major drawbacks exist for these experimental methods, including the low interaction coverage, the experimental biases toward certain protein types and cellular localizations, and the high cost, both in money and time. These problems have motivated the development of computational methods as alternative and supplemental approaches to predicting protein-protein interaction (PPI).;In this dissertation I propose new computational methods to address protein-protein interaction prediction. Specifically, this work tackles the following questions: given two proteins, (i) whether they interact, (ii) if they interact, where are the interacting residues, and (iii) how these interacting residues are paired up, namely the contact matrix. First, I developed a PPI predictor based on hidden Markov models (HMM) and support vector machines (SVM) to predict if two proteins interact. The method builds models of known interacting interfaces based on domain-domain interacting (DDI) families, i.e. groups of protein pairs that bind through the same pair of domains. Each interacting domain family is modeled with a HMM that differentiates interacting residues from non-interacting residues. The proposed algorithm is a two-stage pipeline that combines the flexibility of a generative learning model in the first stage---the domains' HMMs---with the differentiation power of a discriminator---a SVM---in the second stage, connected by a feature selection mechanism based on singular vector decomposition applied to the attributes extracted from domain HMMs as measured by the Fisher score. Once trained, the model can predict if two new proteins interact or not. The method significantly outperformed a previously proposed technique that uses the same input data.;Second, I tackled the problem of predicting the binding/functional sites in protein-ligand and enzyme-substrate interactions. These functional residues actually correspond to a protein's binding surface that connects to a chemical, or the active (recognition) site in an enzyme. In the context of a family of related proteins, of which a quantitative measure of the functional relationships among member proteins is available, I developed statistical learning methods that predict the binding/functional residues by finding those positions in the family's multiple sequence alignment with highest correlation to the functional codification of the family. The methods utilize canonical correlation analysis (CCA), kernel CCA (kCCA) and multi-positional kCCA to incorporate non linear correlations between residues and also to analyze clusters of residues as a whole. When tested on benchmark datasets, the proposed methods significantly outperformed known algorithms that treat residues individually and independently.;Third, I proposed a method to further predict how the residues are paired up across the interface, namely the contact matrix, whose rows and columns correspond to the residues in the two interacting domains respectively and whose values (1 or 0) indicate whether the corresponding residues (do or do not) interact. The method is based on the platform developed in the first part, the PPI predictor. Instead of using Fisher scores to represent the whole domains as modeled by HMMs, they are reformulated to represent individual residues. Each element of the contact matrix for a sequence-pair is now represented by a feature vector from concatenating the vectors of the two corresponding residues, and the task is to predict the element value (1 or 0) from the feature vector. The sequence-pairs in a DDI family are split into two sets, one for training and one for testing. A support vector machine is trained for a given DDI, using either a consensus contact matrix or contact matrices for individual sequence pairs. The method significantly outperformed a previous multiple sequence alignment based method. Our proposed algorithm is capable of extracting characteristic features and at the same time untying the residues from the rigid multiple sequence alignments that are used in the previous methods. This enables handling residues corresponding to delete and insert states, and allows for a supervised learning on individual contact points, eliminating the need of a consensus contact matrix for the domain families, which has been a main source for false predictions. While designed for predicting contact points between interacting protein domains, the method may be useful as a module in protein folding and docking.
Keywords/Search Tags:Interacting, Protein, Predict, Contact, Residues, Method, Domains
Related items