Font Size: a A A

Research On Protein Function Prediction And Functional Module Discovery Incorporating Feature Selection

Posted on:2021-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:H F SunFull Text:PDF
GTID:2370330629480160Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The completion of human genome sequencing has made proteomics research one of the important fields of life science.Proteins participate in various life activities of the human body,for example genetic material replication,gene expression control,metabolism and other activities are dependent on protein-protein interaction(PPI).Therefore,the study of PPI network is helpful for people to systematically understand its diverse biological processes,and PPI network has attracted more and more attention in the post-genome era.Under the current development trend of science and technology,high-throughput technology is also in constant improvement.Thanks to the development and improvement of high-throughput computing,a large number of data on protein interactions have been collected.However,although the functions of some of these proteins have been labeled,the number of proteins that have not been labeled is increasing with the collection of PPI network data.Therefore,how to label protein function scientifically and efficiently becomes an important content in biological research.In the protein data that has been collected so far,the feature information of a large part of the proteins have been clearly defined,while more of the protein data only exist in the interaction network,and they have no other additional feature information to help predict the function.For proteins that exist only in the interaction network without other characteristic information,we can't directly use classification method to predict their functions,but we can consider using the network embedding method to learn the low-dimensional representation of each protein in the PPI network as a characteristic of the protein to predict protein function.And those protein with characteristic information,we can use a clustering method to decompose the PPI network and characteristic information while removing the noise in the data,so as to identify the functional modules in the PPI network.Therefore,in this paper,based on the theoretical knowledge of machine learning such as the analysis of PPI network topology,feature learning and sparse representation,network embedding and non-negative matrix factorization,we propose two effective methods to label the functions of proteins:(1)A multi-label learning method based on network embedding is proposed to automatically label protein functions.First,we weight the original PPI network with edge betweenness to obtain a new weighted adjacency matrix.Then,ISOMAP algorithm was used to embed the new adjacency matrix into the low-dimensional space to obtain the lowdimensional eigenvectors of each protein node.Then,the low-dimensional feature representations of protein nodes is put into the framework of multi-label learning to form a multi-label linear regression model.At the same time,a sparse penalty term is introduced into the model to obtain the most representative feature of protein nodes.Finally,considering that a protein may have multiple functions and there may be correlations between protein functions,we introduce a regularization term of functional correlation into the multi-label learning model.Experiments on PPI network confirm the effectiveness of the proposed method,and compared with several methods of protein function prediction,the proposed method also shows higher accuracy.(2)A method for identifying functional modules in PPI network based on non-negative matrix factorization is proposed.After obtaining the functional modules,the protein labeling functions in each module are respectively labeled.Firstly,we decompose the adjacency matrix of PPI network to obtain the module membership matrix,and keeping the calculated expected edges closely consistent with the original network topology.Then,the membership matrix and module feature matrix are obtained by decomposing the feature matrix describing the protein properties.Finally,the exclusive group lasso constraint is introduced to learn the most relevant features of each module.For the optimization,we design an efficient algorithm to iteratively solve several subproblems with closed-form solutions.Comparative experiments on LFR synthesis network and DIP datasets prove that the proposed method is more accurate than other module discovery methods.
Keywords/Search Tags:Protein-Protein Interaction Networks, Function Annotation, Network Embedding, Multi-label learning, Feature Selection, Nonnegative matrix Factorization, Function Module Discovery
PDF Full Text Request
Related items