Font Size: a A A

Integrating Machine Learning Algorithms With Gene Ontology For Research On Protein Function Prediction

Posted on:2024-04-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H ZhuFull Text:PDF
GTID:1520307331972699Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Proteins are the material basis of life and play important roles with various functions in nearly all living activities.The functions of proteins are mainly annotated using Gene Ontology(GO)and can be divided into three groups,including molecular function,biological process,and cellular component.Accurate annotation of protein functions is critical to elucidate the vital activity phenomenon and disease pathogenesis,and guide drug design.Direct determination of protein functions via biology experiments is standard but often time-consuming and expensive and cannot keep pace with related research progress in the post-genome era.In light of this,it is urgent to develop efficient computational methods for protein function prediction.This paper focuses on the applications of machine learning algorithms in protein function prediction.Considering that the functions of proteins are highly associated with their coding genes and ligands,we do research on protein function prediction from the views of coding gene,protein itself,and ligand,respectively,and propose the corresponding critical issues,i.e.,the similarity metric of gene expression,protein language modeling,and class imbalance.For these three issues,we design solutions with the corresponding function prediction methods through metric learning,deep learning,and class-imbalance learning algorithms,respectively,from the view of machine learning.The main contributions of this paper are summarized as follows:(1)A metric learning-based protein function prediction method,Triplet GO,is proposed through integrating multi-source information fusion with template-alignment theory from the view of protein-coding gene.Triplet GO consists of four sub-pipelines,which are driven by gene expression data,genetic sequence,protein sequence,and the na(?)ve probability of GO terms,respectively.From the view of gene expression data,we propose a triplet network with gene expression similarity-based GO prediction method(TN-GESGP),as a sub-pipeline in Triplet GO.To overcome the defect of traditional unsupervised methods that they fail to associate expression similarity with functional similarity,TN-GESGP utilizes a supervised triplet network to measure expression similarity and designs triplet loss to enhance the relationship between expression similarity and functional similarity.Computational experiment results on benchmark datasets have demonstrated that Triplet GO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches.The advantage of proposed methods is mainly attributed to the used triplet network which can effectively associate expression similarity with functional similarity to improve the quality of selected functional templates.Meanwhile,four sub-pipelines from different information sources provide complementary knowledge to improve prediction performance of Triplet GO.(2)A new deep learning-based function prediction method,named attention-based transformer with triplet network for GO prediction(ATGO),is proposed from the view of protein itself.To relieve the restriction on the performance of deep learning models caused by the lack of annotated function data,ATGO utilizes the unsupervised attention network to train a protein language model on tens of millions of sequences,which is used to simulate the evolution process of sequences to extract the corresponding feature representations for function prediction.To explore the complementarity between deep learning-based and template-based methods,we further implement a composite version,ATGO+,by combining ATGO with the protein sequence alignment-based GO prediction method(PSAGP).Experimental results on benchmark datasets demonstrate a significant advantage of proposed methods on accurate function prediction over the current state-of-the-art of the field.This advantage mainly stems from the used protein language model which can encode the discriminative feature representations through learning the abundant knowledge of evolution-to-function from enormous sequences.Meanwhile,the triplet network enhances the relationship between feature similarity and functional similarity.Moreover,ATGO and PSAGP provide complementary information for further performance improvement.(3)The protein-ligand binding site prediction is researched using machine learning algorithms from the view of ligand-binding,which is an important term in molecular function of GO.To solve the class-imbalance issue in this field,we develop an ensembled hyperplanedistance-based support vector machines(E-HDSVM).Different from traditional classimbalance algorithms,the under-sampling in E-HDSVM is driven by the characteristic of support vector machine(SVM),which can relieve the negative impact caused by information loss or redundancy to improve the overall performance of SVM on class-imbalance dataset.For an important ligand,i.e.,DNA molecule,a corresponding protein-DNA binding site predictor,DNAPred,is implemented by integrating E-HDSVM with sequence-based feature representations.Experimental results on benchmark datasets demonstrate that DNAPred achieves significantly better performance than existing DNA-binding site predictors,which is mainly attributed to that E-HDSVM can effectively solve the class-imbalance issue.
Keywords/Search Tags:protein function prediction, gene ontology, protein-coding gene, machine learning, metric learning
PDF Full Text Request
Related items