Font Size: a A A

The Study Of The Method For Predicting Protein-protein Interactions

Posted on:2010-01-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:M G ShiFull Text:PDF
GTID:1100360275955556Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the human genome sequencing and the completion of the draft work, genomics research has been gradually shifting from the focus of structural genomics to functional genomics,bio-medicine enters a new era - the post-genome era.In the post-genome era,an important task is the study of proteomics.Benefiting from a growing number of high-throughput experimental technologies and becoming more mature,it has accumulated a large number of proteomic data.The current problem is that the means and ability of data analysis and study seriously lag behind,making the obtained data via a lot of human labour and financial supports fail to produce more meaningful results of biology.Therefore,the development of advanced and highly efficient information analysis and data mining tools to find internal links from a large number of proteins and complex set of data so as to reveal the relationship between protein function and protein interaction is of vitally important significance.Protein-protein interaction is the hot and difficult spots in the molecular biology investigation.Protein is a major life activity carrier and function executor,and the in-depth study of its complex and diverse structure and function,interaction and dynamic changes will be helpful to reveal the nature of life phenomenon in the molecular,cellular and organism levels.Protein-protein interaction is an important part in the process of diverse life activity in organisms,and the basis of biochemical reactions in organisms as well as the main task of post-genome era.When experimental methods provide large amounts of data,they also at the same time,will bring a large number of false positive and false negative data.Therefore,this thesis studies protein-protein interaction from the perspective of computing,which mainly includes the research and exploration of applying machine learning methods for protein-protein interaction prediction problems.This thesis mainly includes the following facets:1) Anovei method of predicting protein-protein interaction was proposed in this thesis based on amino acid evolutionary conservation.Under the rule of natural selection,amino acid residues that are involved in the function of a given protein family are more conservative.The interaction between proteins and environments depends on these important residues.Starting from the protein sequence,a new correlation coefficient based on the amino acid sequence of the encoding is illustrated. The encoding sequence scheme considers the internal long-range interactions and sequence relationship between the co-evolution.For positive and negative learning samples,this thesis adopted the positive samples from the DIP,MIPS and BIND and the negative samples constructed from four different ways including:1) R-NEG constructed by randomly selecting protein structure;2) IS-NEG constructed through the subcellular localization and the use of the same range of sub-cellular structure of the protein;3) BS-NEG constructed through the subcellular localization and the use of subcellular localization in different sub-cellular range of structural protein;4) GO-NEG constructed through the Gene Ontology information available from RSSBP and RSSCC with lower value.The comparison of the combination of MIPS Core and GO-NEG with other 11 kinds of combination shows that the prediction accuracy for the former is higher,wherein the value of P with statistical significance is minimum. Thus,the MIPS Core and GO-NEG are called as gold standard positive samples and gold standard negative samples,respectively.In addition,compared with the known amino residual encoding auto-correlation,the correlation coefficient encoding scheme yields better prediction results.The prediction results for across-species show that the SVM model based on the correlation coefficient encoding scheme has better generalization ability.2) An improved GPCA plus LDA model was constructed to predict the protein-protein interation,which can effectively improve the prediction accuracy of the membrane protein-protein interaction.The base coordinates obtained by means of Greedy KPCA(GPCA) algorithm were directly from the sample data,while the ones by KPCA algorithm were derived from a linear combination of sample data.Although the greedy algorithm based on KPCA algorithm is sub-optimal,it can greatly reduce the computational complexity comparied with tranditional KPCA algorithm.For single-celled eukaryote of the yeast Saccharomyces cerevisiae,most of integral membrane proteins of Saccharomyces cerevisiae can not be verified by experiments. We proposed the use of 21 sturctures and sequence features for membrane protein interaction to construct 56 positive samples and 150 negative samples.It was found in experiments that based on the kernel method of GPCA plus LDA,300 protein-protein interactions involving 189 membrane proteins are of high reliability prediction results. The experimental results also show that although the GPCA plus LDA method performs feature reduction and removes the redundancy between the data,the obtained results were only slightly better than the LDA method;for GPCA plus LDA method which solves the loss problem of high-order information,the obtained prediction results are better compared with KPCA plus LDA method;the variance of the prediction correct rate for GPCA plus LDA approach is the smallest,which indicates that the difference among the measured values for ten times is smaller,thus this method has better robustness.Moreover,it revealed by computing that the interactions of membrane proteins are of the properties of small-world effect and scale-free properties.3) A novel Bayesian additive regression tree(BART) model was proposed to infer protein-protein interaction.BART is a newly integrated approach,which is a classifier ensemble system formed by decomposing BART model into a number of weak classifiers through non-parametric Bayesian regression method.Moreover, BART prediction model based on the integration of backfitting MCMC algorithms obtained better prediction accuracy for protein-protein interation.Particularly, compared with standard MCMC methods,the proposed integration of the backfitting MCMC algorithm can effectively avoid the local minimum situation.At the same time, an independent test set based on BART model achieves better prediction results, which indicates that BART model has good generalization ability.
Keywords/Search Tags:Protein-protein interactions, Protein sequence, Correlation coefficient encoding, Support vector machine, Gold standard positives dataset, Gold standard negatives dataset, Membrane Protein, Greedy kernel principal component analysis
PDF Full Text Request
Related items