Font Size: a A A

Research Of Protein Function Prediction Based On The Gene Ontology Structure

Posted on:2019-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:G Y FuFull Text:PDF
GTID:2370330566480047Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Proteins lay the foundations for many life processes,a variety of important functio ns within the body require the participation of proteins.Integrating the features and functional information from the large volumes of proteomic and genomic data and automaticly annotating them with accurate functions can significantly boost the analysis and regulation of disease mechanism,new drug research and development,promotion of crop production,bioenergy development and so on.Recently,many machine learning based algorithms for protein function prediction received extensive attention and performed wel.However,current methods often focus on predicting functions of completely un-labeled proteins and modeling protein functio n prediction problems as general multi-label learning problems,ignoring the incompleteness and unbalance in functional annotation caused by experimental bias and biology interests.These methods also neglect the importance of the hierarchical structure relationships between functional labels,hence get low accuracy.Multiple heterogeneous proteomic data sources can be converted to functional association networks to overcome heterogeneity,some approaches focus on integrating these multiple heterogeneous data sources to improve protein function prediction and they often get better results than methods that use single data source alone.But how to efficiently integrate these data is still a difficulty for such methods.Furthermore,sufficient positive and negative samples help to train better classifiers and improve the accuracy.Due to the absence of negative examples for function labels,most of current methods only take advantage of known positive labels for predicting protein functions or heuristically select negative labels,but pay little attention to the identification of negative labels,which also cause the limita tio n of accuracy.To address above problems associated with protein function prediction,we start from the hierarchical structure of Gene Ontology with the aim to improve the prediction accuracy,and take machine learning model construction and solution as basic tools.We conduct extensive research on novel protein function prediction,negative functio n prediction and multi-source data integration based protein function prediction,and propose several effective algorithms.In summary,the key contributions of the thesis are:(1)To address the incomplete functional annotations of protein,we proposed a method caled dHG.dHG predicts novel functions by performing random walk with restart on a directed hybrid graph,which consists of Gene Ontology hierarchical structure and protein-protein interaction network.dHG takes the incompleteness of known protein function annotations and the hierarchical relation between functions into account,experimental analysis shows dHG can predict not only new functions for partially labeled proteins,but also new functions for completely unlabeled proteins.Considering the structural difference in hybrid graph,we further introduced a method caled NewGOA.NewGOA also considers the influence of noise in protein-protein interaction.It applies a bi-random walks algorithm,which executes asynchrono us random walks on the hybrid graph,to predict new GO annotations of proteins.NewGOA inherits all the advantages of dHG,experimental study on archived GO annotations of two model species(H.Sapiens and S.cerevisiae)shows that NewGOA can more accurately and efficie ntly predict new annotations of proteins than other related methods.(2)To identify negative annotations of proteins,we presented a novel approach(caled NegGOA)to select negative examples.Specifically,NegGOA takes advantage of the ontology structure,available annotations and potentially novel annotations of a protein to choose negative examples of the protein.Experimental study demonstrates that,NegGOA suffer less from the incomplete annotations and the negative example prediction improves the accuracy of protein function prediction.To take advantage of feature information of proteins and the few available negative examples,we introduced a protein function prediction approach using positive and negative examples(ProPN).ProPN employs label propagation on a direct signed hybrid graph,which is encoded with positive examples,negative examples,interactions between proteins and correlations between functions,to predict protein function.The experimental results show that ProPN not only makes better performance in predicting negative examples for partially annotated proteins than state-of-the-art algorithms,but also performs better than other related approaches in predicting functions for completely unlabeled proteins.Considering the difficulty in dealing with huge collection of functions and noise in protein-protein interaction network,we propose an approach caled predicting irrelevant functions of proteins based on dimensionality reduction(IFDR).IFDR firstly separately applies random walks on the adjacent matrix of protein-protein interactions network,and that of protein-func tio n association matrix to explore the underlying relationships between proteins,to model the missing functional annotations of proteins.Next,IFDR uses single value decompositio n to project these two adjacent matrices into two respective low-dimensional numerica l matrices.After that,IFDR takes advantage of semi-supervised regression to predict negative examples of proteins.Experimental results on S.cerevisiae,H.sapiens,A.thaliana show IFDR can more accurately predict negative examples than other related competitive methods.Both dimensionality reduction in the network space and label space can improve the accuracy of negative examples prediction.(3)To integrate multiple data sources for protein function,we propose a method,caled SimNet,to semantically integrate multiple functional association networks derived from heterogeneous data sources.SimNet firstly utilizes GO annotations of proteins to capture the semantic similarity between proteins and introduces a semantic kernel based on the similarity.Next,it constructs a composite network,obtained as a weighted sum of individual networks,and aligns the network with the kernel to get the weights assigned to individual networks.Then,it applies a network-based classifier on the composite network to predict protein function.Experiments shows SimNet can more effective ly integrate multiple networks and cost less time than other related methods.To differentially integrate multiple networks and handle more functions,we proposed a Protein function prediction approach based on multiple networks Collaborative Matrix Factorization(ProCMF).To explore the latent relationship between proteins and between labels,ProCMF firstly applies nonnegative matrix factorization to factorize the proteinfunction association matrix into two low-rank matrices and defines two smoothness terms on these two low-rank matrices to guide the collaborative factorization with proteomic data.To differently integrate these networks,ProCMF aims to assign different weights to them.In the end,ProCMF combines these goals into a unified objective function and introduces an alternative optimization technique to jointly optimize the low-rank matrices and weights.Experimental results on three model species(Yeast,Human and Mouse)with multiple functional networks show that ProCMF outperforms other related competitive methods.ProCMF can effectively and efficiently handle massive labels and differentially integrate multiple networks.
Keywords/Search Tags:Protein function prediction, Gene Ontology, Machine learning, Negative samples, Data fusion
PDF Full Text Request
Related items