Font Size: a A A

Research On Symmetric Prediction Of Protein-Protein Interactions Based On Pairwise Kernels

Posted on:2012-11-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:J T YuFull Text:PDF
GTID:1110330362950138Subject:Artificial Intelligence and information processing
Abstract/Summary:PDF Full Text Request
Proteins are directly involved in biological processes, often exerting their function via protein-protein interactions. Constructing protein-protein interaction networks is, therefore, very beneficial for investigating molecular functions and discerning where groups of proteins may locate, as well as furthering our understanding of disease associations for identifying drug targets. In silico methods of predicting protein-protein interactions have recently emerged as an important area of Bioinformatics, because they often overcome the drawbacks of wet-lab experiments, such as expense (both time and money) and high false-positive rates. Of the available machine-learning approaches for predicting interaction data, kernel-based methods are popular due to their robustness and high performance. However, methods for maintaining the symmetry of predictions, i.e.'A is predicted as interacting with B', should be equivalent to'B is predicted as interacting with A', made by kernel functions have not been well studied, and the symmetry problem appears to directly affect the effectiveness and the performance of these predictive models.This thesis, thus, focuses on how to retain the symmetry of protein-protein interactions by using pairwise kernels, which adopt symmetric calculations on the measurement of similarity between pairs of proteins. The biases that originate from traditional kernel-based predictors and training datasets are revealed, and the methods for removing these biases are correspondingly proposed. As an application of these methods, unbiased predictive models are created and used to predict a large number of protein-protein interactions in soybean for the first time.More specifically, there are three main aspects which are focused on in the thesis:Firstly, the prediction bias towards protein order is revealed when traditional kernel-based methods are used. The pairwise kernel is then introduced to fix the problem and a new pairwise kernel is proposed, that utilizes important properties that have already been shown as useful when predicting protein-protein interactions.Protein-protein interactions are of symmetric character. However, when examples are formed by simply uniting two proteins sequentially, where one protein behaves as the first half of the example, and the other as the second half, traditional kernel functions are of little use. This is due to their inability to'split'one example into two proteins, and be sensitive to the order of proteins, resulting in inconsistent prediction conclusions, such as'A interacts with B', whilst'B does not interact with A'.Pairwise kernels are appointed to remove asymmetry resulting from the traditional kernels. Pairwise kernel functions regard proteins, rather than examples, as the minimal'unit', and consider both'normal'and'reverse'orders for measurement of similarity between two pairs of proteins. The necessity of pairwise kernels to keep symmetric prediction is underlined. Furthermore, the principles of creating pairwise kernel functions, such as symmetry, (semi-)positive definiteness, and balances between variables, are summarized. Based on these principles, a novel pairwise kernel, AMPK (Arcsin Maximum Pairwise kernel) is created, which performs on par with the current best pairwise kernel, and a novel combination model of pairwise kernels,'AMPK based on Cosine plus AMPK based on Laplace', is also proposed, which has been proven to outperform the current kernel, or kernel-combination methods, in predicting interactions of protein complexes.Secondly the performance of pairwise kernel-based classifiers are discovered to be artificially inflated when simple sequence features (neighboring three residues, 3mers) are used on traditional datasets, in which negative datasets are made by the'simple random sampling'method. The novel'balanced random sampling'method is proposed to overcome the bias via constructing rational negative dataset, on which objective evaluation of classifiers'performance for unbiased prediction is acquired.The traditional PPI positive dataset is shown as a scale-free network, and the traditional PPI negative dataset is as a random network. This causes hub nodes, which are highly connected with other nodes in the positive dataset, to appear less frequently in the traditional negative dataset. The difference of the number of times each protein appears in positive and negative dataset results in prediction bias of protein-protein interactions. When 3mers are used as sequence features, the bias becomes even more serious. In this case, pairwise kernels are prone to labeling examples which involve hub proteins as'positives', and those which do not involve hub proteins as'negatives'. This kind of prediction is purely based on the number of times each protein appears in dataset and does not aid in making predictions, but can still cause prediction performance to appear artificially high.In order to remove these biases, the'balanced random sampling'is proposed, aimed at creating a rational negative dataset, simulated as scale-free like the positive dataset. During the process of balanced random sampling, each protein has equal opportunity to appear in the positive or the negative dataset, and the bias towards the number of occurrences of each protein per dataset is, therefore, removed. Rational datasets form a basis for objective evaluation of the performance of pairwise kernel-based classifiers, and show that previous estimations of prediction performance, using 3mer features, were over-optimistic. However, complex sequence features, i.e. Pfam domains, are proven to be less sensitive to the traditional datasets than 3mer feature, and have a positive contribution to the prediction of protein-protein interactions.Thirdly, we use the newly sequenced Glycine max (soybean) genome, to infer a large number of soybean protein-protein interactions for the first time. To make these novel inferences we use conventional methods of homologous protein-protein interactions (interologs) and kernel-based predictive model mentioned above, resulting in 10 426 confidential soybean protein-protein interactions.Predicting soybean protein-protein interactions was one of the main tasks following the sequencing of the soybean genome. More than ten thousand soybean protein-protein interactions have been successfully predicted with our in silico method. Soybean interologs are primarily inferred from protein-protein interactions of homologous species, and then filtered by pairwise kernel-based methods, using domains as the classifier feature. More specially, the candidate dataset of soybean interactions are obtained by looking for soybean interologs from homologous protein-protein interactions in Arabidopsis thaliana, Saccharomyces cerevisiae, and Homo sapiens, and then domain-based pairwise kernel methods act as unbiased predictive classifiers to filter interologs, during which a cross-species strategy is used: training on data from the source species (Arabidopsis, Saccharomyces, or Homo sapiens), and testing on data from soybean. This novel transferability of methods between species is proposed according to conserved domain-domain interactions which are presented in both'source'and'target'species. This is the first time that a large number of soybean PPIs have been predicted using computational methods, and prediction performance is assessed using cross-validation. The combination of homologous PPIs and domain-based pairwise kernels used in this thesis are concluded to be effective methods in predicting protein-protein interactions of organisms whose genome is newly sequenced. Finally, soybean protein complexes in a predicted protein-protein interaction network are revealed and interactions between Plant Resistance genes/proteins within protein complexes are investigated in order to infer some related biological function.
Keywords/Search Tags:Protein-protein interaction, Pairwise kernel, Symmetric prediction, Prediction bias, Multi-kernel combination, Protein complex, Random sampling, Soybean
PDF Full Text Request
Related items