Font Size: a A A

Research On Probabilistic Topic Model For Protein Function Prediction

Posted on:2018-07-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:1360330518954989Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Proteins are important and diverse macromolecules of living cells,and research on protein function is pivotal for developing new drugs,better crops,and synthetic biochemical compounds.Prediction of protein function is essentially a multilabel classification problem,and at present,methods for protein function prediction are mainly based on the traditional discriminant classification model.However,several factors such as the large number and the hierarchy correlation of protein functional annotation labels challenge the traditional label classification method of protein function prediction.A topic model is a kind of probabilistic generative model,which originated in the field of text mining.Multilabel supervised topic models can not only perform hidden pattern mining,but can also realize multilabel classification of documents in the form of supervised learning.Thus,it would play an important role in improving both the accuracy of protein function prediction and the interpretability of its predicting outcomes by introducing multilabel supervised topic model into protein function prediction.Considering the problems associated with protein function prediction,this dissertation extensively investigates several key problems related to the construction of multilabel supervised topic model and designing of its learning algorithm.The main research contents and relevant achievements in this dissertation are as follows.(1)Considering the large number of functional labels and the high correlation between labels in protein characteristics,an exact Boolean matrix decomposition(BMD)algorithm based on label clusters was proposed.This algorithm realized the hierarchical extended clustering of labels by the label-associated matrix.The experimental results show that this algorithm can complete exact BMD decomposition and possesses considerable advantage in reducing the computational complexity.In addition,reducing and restoring dimensions in the functional label space of proteins using this algorithm laid the foundation of a more efficient classification of multilabel classifier.(2)Multilabel supervised topic modeling was applied to protein function prediction,and a Label Distribution LDA model(LD-LDA)was proposed to improve the correspondence between label-topic-word in the existing models.Four different learning algorithms were designed for this purpose,including Collapsed Gibbs Sampling(CGS),Variational Bayesian(VB)inference,collapsed Variational Bayesian(CVB)inference,and zero-order Variational Bayesian(CVBO)inference.The LD-LDA model expands the generation theory of Labeled LD A(LLDA)and Partially labeled LD A(PLDA)models by expressing each observed functional label as a probability distribution on the global hidden topic space and introducing a background label to describe the topics that do not associate closely with the functional label.The experimental results showed that the LD-LDA model realized the more elaborate description of label hidden substructure in functional labels compared to the two existing models,and it further improved the accuracy of protein function prediction.(3)The Dirichlet multinomial regression(DMR)framework was introduced in multilabel topic modeling for assisting protein functional label prediction using the observed feature information of the proteins.Towards this objective,three improved models of multilabel supervised topic model,including DMR-LLDA,DMR-PLDA,and DMR-LDLDA,were proposed in this dissertation.Furthermore,three learning algorithms were designed for each model.By applying an exponential priori constructed previously with weighted features on the hyper-parameters of protein-topic(or label)distribution,this model utilizes the observed features of each protein in multilabel topic modeling.Results show that the three improved models improved the accuracy of protein function prediction due to the use of protein feature information in addition to the amino acid sequence.(4)Two correlated label-supervised topic models,namely,CLLDA and CLDLDA were proposed based on the DMR-LLDA,and DMR-LDLDA for using the hierarchy correlation information between protein functional labels in function prediction.In addition,three learning algorithms were designed for each model.This model utilizes label-associated features that describe hierarchy correlation among labels to optimize the hyper-parameters of global label-word(or topic)multinomial distribution.The experimental results show that this design strategy,which involves label-associated features,can further increase the accuracy of protein function prediction using the improved models.In summary,dimension reduction of protein functional labels was realized first using the Boolean matrix decomposition,and thereafter using several aspects of designing to improve multilabel supervised topic modeling and the associated learning algorithms.This dissertation provides an accurate and effective calculation method for protein function prediction.However,this dissertation focuses not only on the study of protein function prediction method,but also on the extension of multilabel supervised topic modeling,which can be applied to different classification scenario.
Keywords/Search Tags:Protein function prediction, Multilabel classification, Probabilistic topic model, Boolean matrix decomposition, Dirichlet multinomial regression
PDF Full Text Request
Related items