Font Size: a A A

Research On Heterogeneous Feature Information In Assessing Reliability Of Protein-Protein Interactions

Posted on:2010-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y M OuFull Text:PDF
GTID:2178360275989234Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
High-throughput experimental and computational methods are generating a wealth of protein-protein interaction(PPI) data for a variety of organisms. However, data produced by current state-of-the-art methods include many false positives, which can hinder the analyses needed to derive biological insights. One way to address this problem is to assign confidence score for each interaction by computational techniques. The key of this work is feature selection and extraction, algorithm design and realization. This paper investigates heterogeneous feature information in assessing reliability of PPI.Part I. This paper investigates the relationship between PPI and gene expression profiles, and the relationship between PPI and subcellular localization in Yeast by statistics methods. 4 PPI example sets were constructed, including positive set, negative set, random negative set and co-complex set. For all the protein pairs in the 4 datasets, I compare their distributions of the gene co-expression based distances and compare their co-localization frequency of protein pairs for proteins with known subcellular localization. Results showed that the gene expression profiles of interacting proteins have higher similarity in comparison to non-interacting pairs, and interacting proteins have the tendency of same subcellular localization.Part II. Based on heterogeneous data resources and least squares support vector machine(LS-SVM) classifiers, this paper presents a computational system to assess the reliability of the PPI in Yeast. These data resources involve six different data types, including amino acid sequences, domain-domain interactions, protein function annotation, gene expression profiles, subcellular localization and pseudo-amino acid composition. In MATLAB environment, large data preprocessing and feature extraction are realized by programming approach. 8 400 protein pairs in the example set are coded by these features, resulting in 125-dimensional attributes. Using incorporation of heterogeneous features, a prediction model is trained and tested in LS-SVM. This prediction model achieves an overall accurate prediction rate of 76.37%, evaluated by 3-fold cross-validation test. Ulteriorly this work compares the accuracy using direct and indirect features, single and incorporate features; and reveals the implied and immanent relationship among these high-throughput data.In this work, a quantitatively analysis were done among these high-throughput data and unknown knowledge were inferred upon the co-relationship of heterogeneous features. In a way, this work integrates biological data from different sources. It can offer more broad and embedded information for cell life mechanism research, and offer reference for interrelated research of many other species which data is still incomplete.
Keywords/Search Tags:Protein-Protein Interactions(PPI), Assess Reliability, Least Squares Support Vector Machine(LS-SVM)
PDF Full Text Request
Related items