Font Size: a A A

Protein Function Prediction Based On Multi-source Data Integration

Posted on:2022-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:L X WangFull Text:PDF
GTID:2480306569997529Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,the rapid development of life science and information technology have advanced the development of basic research on gene and protein.Proteins are involved in all aspects of biological activities,and related research of proteomics plays a key role in revealing the life code,which is also the core of current bioinformatics and the key support of precision medicine in the future.The Gene Ontology project unifies the protein functions of different species,and each function is described by an term,which has a father-child relationship.In today's information era,new data are extracted every moment,including a large number of protein sequence data and protein interaction data.However,most of the functions of these proteins are unknown,so using some method to measure or calculate the functions of these proteins will greatly promote the development of proteomics,and will also promote the development of drug research and development.The method based on biological experiments costs a lot in the determination of protein function,and some professional biological knowledge is needed,as a result of the emergence of the strategy of automatic protein function prediction based on deep learning technology.The existing deep learning methods to annotate proteins usually only consider the protein sequence,but ignoring the binding effect of protein-protein interaction network on protein function,and vice versa.In addition,most of the network embedding algorithms are not suitable for the protein-protein interaction network.Moreover,some of the sequence models only use one-hot encoding to represent the amino acid sequence,and a shallow neural network model is used to extract the sequence features,which cannot capture the complex nature of the sequence to some extent.In the view of the above limitations,this dissertation makes the following contributions.The protein-protein interaction network is constructed,and a new network representation learning algorithm is proposed to obtain the distributed representation of proteins.Moreover,the representation vector of amino acid substring is generated by the similar method of obtaining word representation,and then the matrix representation of protein is obtained.At the same time,a deep sequence model is constucted based on biological knowledge,extracting the key features of protein sequence automatically.Finally this sequence feature will be concatenated with the mathematical representation of protein-protein interaction network as the overall feature for function prediction.Experimental results on human and mouse datasets show that the proposed method achieves state-of-theart in terms of AUC and some other metrics.In this dissertation,protein sequence and protein-protein interaction network data are integrated to annotate proteins.In order to obtain the embedding representation of the network,we start from the characteristics of protein-protein network and improve the existing embedding algorithm to make the similar protein have similar mathematical representation.At the same time,biological knowledge is referred to in the use of deep learning model to obtain protein sequence characteristics.This kind of domain knowledge guided deep learning model has important significance for other fields.
Keywords/Search Tags:protein function prediction, gene ontology, deep learning, network representation learning
PDF Full Text Request
Related items