| Proteins are important components of cells and tissues in organism,which are produced by the translation of mature mRNAs(isoforms).Most life activities in the body require the involvement of proteins,so accurately predicting the functions of proteins can help humans to better understand the principal of life activities,explore the mechanism of diseases and design new drugs.Existing protein function prediction studies are developed on the gene level,which in essence collectively predict the functions of genes.However,a gene can product more than one protein variants after transcription and translation,so the functions of a gene do not directly map to each protein variant.Therefore,how to predict the function of isoform has been becoming a new research direction of protein function prediction.However,the lack of isoform level data and ground-truth annotations hinders the development of isoform functional prediction.With the rapid development of high-throughput RNA-seq technology,a large number of transcriptome sequence data were obtained,which provides a high-resolution data for distinguishing different isoforms.As a result,some isoform functional prediction algorithms have been developed based on RNA-seq data.These isoform function prediction algorithms leverage known gene functional annotattions and gene-isoform associations to complete the function prediction task.However,this prediction paradigmignores the important data at the gene level,such as gene interaction data and Gene Ontology(GO)data.In addition,existing prediction algorithms still have two problems to be solved:(i)they all assume that the known gene functional annotations are complete,but the known annotations are incomplete;(ii)only gene functional annotations are extended to their isoforms,but the aggregation of functional annotations from isoforms to genes is not considered.To address above problems associated with isoform function prediction and to improve the current isoform function prediction accuracy,we start from effective combination of gene interaction data and GO hierarchy data,and take multi-label multi-instance learning framework construction as the model basis.We propose two effective algorithms for isoform function prediction.In summary,the key contributions of the thesis are:(1)To address the problems of only extending the gene function annotations to its isoform(s)and ignoring the gene interaction data,we propose an isoform function prediction algorithm(IsoFun)based on bi-random walks on a heterogeneous network.IsoFun firstly constructs an isoform functional association network based on the expression profile values of isoforms collected from multiple RNAseq datasets,and assigns all the annotations of a gene to its isoforms.Next,it constructs a heterogenous network composed with isoforms,genes and GO terms,to encode the relationships between genes and isoforms,hierarchical relationship between GO terms and functional associations between isoforms.This heterogenous network can synergy the gene-level interactions,available GO annotations of genes,relationships between genes and isoforms,and thus reduce the impact of incomplete single data alone.After that,IsoFun introduces a bi-random walk based label propagation on the constructed heterogeneous network to predict isoform function.To ensure the known function of a gene being inherited by isoform of this gene,IsoFun clamps the known function to the most ‘responsible’ isoform in each iteration of random walk.Experimental results on the human RNA-seq dataset show that IsoFun achieves a much better prediction than the existing isoform functional prediction algorithms.By comparing IsoFun with its own variants,we further confirmed the advantage of dynamic bidirectional transfer of functional annotations,and of the auxiliary role of gene-level interaction data and GO hierarchy data in isoform function prediction.In addition,IsoFun can effectively differentiate the isoform functions of two genes(ADAM15 and BCL2L1)with known isoform functions.(2)The known gene functional annotations are incomplete.As the time goes by,new annotations will be added.However,existing isoform function prediction algorithms assume the known gene functional annotations are complete.To solve this problem,we propose an isoform function prediction algorithm(DisoFun)with collaborative matrix factorization.DisoFun assumes the functional annotations of genes are aggregated from those of key isoforms.Firstly,DisoFun employes clustering analysis to explore k key isoforms and the correlation between other isoforms and key isoforms.Next,it uses the association between isoform and key isoform to extend the functional annotations of key isoform to all isoforms,and then uses the association relationship between genes and isoform to aggregate the function annotations of isoforms to their originating genes,respectively.After that,we integrate above objectives,and maximize the consistency between the functional annotations obtained by aggregation and the known gene functional annotations.The gene functional annotations are pushed back to the key isoforms to coordinate the identification and function prediction of key isoforms.Given the importance of gene interaction data and the GO hierarchy in gene function prediction,and the incomplete functional annotations of genes.DisoFun respectively employs gene interaction networks and GO hierarchy to construct two manifold regularization items to guide the completion of gene functional annotations,the exploration of gene-key isoform correlations and the function prediction of key isoforms.The results show that DisoFun significantly improves the accuracy of existing isoform function prediction methods.Integrating gene interaction network and GO hierarchy dataeffectively completes the functional annotations of genes and key isoforms,and further improves the accuracy of function prediction of isoforms.In addition,we study several genes(LMNA,BCL2L1 and CFLAR)with known isoform function annotations,and the experimental results prove that DisoFun can accurately differentiate isoform functions of these genes. |