Research On Protein Function Prediction And Functional Module Discovery Incorporating Feature Selection

Posted on:2021-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:H F Sun

Full Text:PDF

GTID:2370330629480160

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The completion of human genome sequencing has made proteomics research one of the important fields of life science.Proteins participate in various life activities of the human body,for example genetic material replication,gene expression control,metabolism and other activities are dependent on protein-protein interaction(PPI).Therefore,the study of PPI network is helpful for people to systematically understand its diverse biological processes,and PPI network has attracted more and more attention in the post-genome era.Under the current development trend of science and technology,high-throughput technology is also in constant improvement.Thanks to the development and improvement of high-throughput computing,a large number of data on protein interactions have been collected.However,although the functions of some of these proteins have been labeled,the number of proteins that have not been labeled is increasing with the collection of PPI network data.Therefore,how to label protein function scientifically and efficiently becomes an important content in biological research.In the protein data that has been collected so far,the feature information of a large part of the proteins have been clearly defined,while more of the protein data only exist in the interaction network,and they have no other additional feature information to help predict the function.For proteins that exist only in the interaction network without other characteristic information,we can't directly use classification method to predict their functions,but we can consider using the network embedding method to learn the low-dimensional representation of each protein in the PPI network as a characteristic of the protein to predict protein function.And those protein with characteristic information,we can use a clustering method to decompose the PPI network and characteristic information while removing the noise in the data,so as to identify the functional modules in the PPI network.Therefore,in this paper,based on the theoretical knowledge of machine learning such as the analysis of PPI network topology,feature learning and sparse representation,network embedding and non-negative matrix factorization,we propose two effective methods to label the functions of proteins:(1)A multi-label learning method based on network embedding is proposed to automatically label protein functions.First,we weight the original PPI network with edge betweenness to obtain a new weighted adjacency matrix.Then,ISOMAP algorithm was used to embed the new adjacency matrix into the low-dimensional space to obtain the lowdimensional eigenvectors of each protein node.Then,the low-dimensional feature representations of protein nodes is put into the framework of multi-label learning to form a multi-label linear regression model.At the same time,a sparse penalty term is introduced into the model to obtain the most representative feature of protein nodes.Finally,considering that a protein may have multiple functions and there may be correlations between protein functions,we introduce a regularization term of functional correlation into the multi-label learning model.Experiments on PPI network confirm the effectiveness of the proposed method,and compared with several methods of protein function prediction,the proposed method also shows higher accuracy.(2)A method for identifying functional modules in PPI network based on non-negative matrix factorization is proposed.After obtaining the functional modules,the protein labeling functions in each module are respectively labeled.Firstly,we decompose the adjacency matrix of PPI network to obtain the module membership matrix,and keeping the calculated expected edges closely consistent with the original network topology.Then,the membership matrix and module feature matrix are obtained by decomposing the feature matrix describing the protein properties.Finally,the exclusive group lasso constraint is introduced to learn the most relevant features of each module.For the optimization,we design an efficient algorithm to iteratively solve several subproblems with closed-form solutions.Comparative experiments on LFR synthesis network and DIP datasets prove that the proposed method is more accurate than other module discovery methods.

Keywords/Search Tags:

Protein-Protein Interaction Networks, Function Annotation, Network Embedding, Multi-label learning, Feature Selection, Nonnegative matrix Factorization, Function Module Discovery

PDF Full Text Request

Related items

1	Algorithm Design And Implementation Predict Protein Function Based On The Random Walk
2	Prediction Of Protein Subcellular Localization Based On Two Regular Term Nonnegative Matrix Decomposition
3	Study On Protein Function Prediction Based On Random Walk
4	Protein Function Prediction And Refinement Based On Manifold Learning
5	Identifying Protein Complexes And Functional Modules In Protein Interaction Networks
6	Protein-Protein Interaction Network Analysis Based On Spectral Method
7	Research Of Protein Complex Extraction Based On Protein-Protein Interaction Network
8	Research On Identification And Application Of Protein Complexes In Protein-Protein Interaction Networks
9	Research On User Relationship Discovery Of Signed Network Based On Nonnegative Matrix Factorization
10	Research On Function Annotation Of Biological Macromolecules Based On Machine Learning