Font Size: a A A

Research On The Methods For Identifying Nucleic Acid Binding Protein And Its Binding Residues

Posted on:2023-06-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:1520307376485114Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of sequencing technology,more and more protein sequences are discovered.However,the functions of most of these proteins are unknown.Identifying proteins’ function and their action residues is the key to deciphering the book of life.Nucleic acid binding proteins(NABPs)play some crucial roles in gene expressions,such as posttranscriptional gene regulation,RNA stability,degradation,splicing,and polyadenylation.However,there is a huge gap between the number of known NABPs and undiscovered NABPs,which hinders the understanding of the interaction mechanism between proteins and nucleic acids.Identifying NABPs and their binding residues(NABRs)is a research hotspot in the field of protein function analysis.Although some biological experiments can be used to detect NABPs and their binding residues,these experiments are technically challenging,costly,and time-consuming.It is difficult to perform large-scale data detection by using these methods.Developing fast and efficient computational methods to provide molecular screening for biological experiments has become increasingly urgent.In this regard,some computational methods were proposed to identify NABPs and their binding residues.However,there are some problems in these existing methods,such as the cross-prediction problem,the problem of long-distance dependencies being ignored,etc.In this paper,several computational methods based on deep learning techniques are proposed to solve the main problems in this field.The main contents of this paper include the following four aspects:(1)To measure and learn protein local conservation patterns and address the crossprediction problem,the method i DRBP_MMC is proposed to identify nucleic acid binding proteins based on multi-label learning and motif-based convolutional neural network.In this method,the multi-label learning framework is introduced to consider both DNAbinding proteins(DBPs)and RNA-binding proteins(RBPs)to solve the cross-prediction problem.Considering that protein functions are related to the conserved domains in protein,this study characterizes protein sequences with evolutionary information profiles and employs the convolutional neural network to capture local conserved patterns in protein.Besides,the motif-based convolution neural network is designed to integrate prior knowledge into the prediction model from protein structural motifs.The experimental results show that the multi-label learning framework can reduce the cross-prediction between different DBPs and RBPs,and the performance of NABP identification can be further improved through the motif-based convolution neural network to integrate structural motifs.(2)To deal with the problems of commonality and difference in NABPs from different species,a species-specific NABP identification method is proposed based on feature induction and transfer framework.This method builds the prediction model based on the motif-based convolutional network proposed in the previous study to fuse protein sequence information and structural motif information.Firstly,the model is initialized and trained by using multi-species protein data to generalize the common characteristics of NABPs from different species.Then,the initialized feature induction model is transferred in a species-specific NABP recognition task based on the ?transfer learning strategy to learn species-specific protein signatures by fine-tuning the model using protein data from that species.The experimental results show that the feature induction and transfer framework can effectively utilize the common and different characteristics of NABPs from different species,and improve the prediction performance of species-specific NABP identification.Besides,this framework can also solve the problem of insufficient training samples for some species.(3)To deal with the problems of the long-distance dependence among amino acids and sequence order information are ignored in existing computational methods,the method NCBRPred is proposed to identify nucleic acid binding residues(NABRs)in NABPs based on a multi-label sequence labeling model.Considering that NABRs are usually located on the surface of proteins and are associated with structural environment and conserved domains,this method employed two evolutionary information profiles to characterize protein sequences and construct residue feature vectors by fusing predicted protein secondary structure and solvent accessibility.Then,a sliding window strategy is used to enhance the representation of amino acid residues.In this method,the sequence labeling framework is introduced to replace the classification framework based on local protein fragment,which employs a bidirectional gated recurrent neural network(BGRU)to process the complete protein in a global manner to use sequence order information and capture the short-and long-distance dependencies among amino acids.The multi-label learning strategy is also used to reduce the cross-prediction between the binding residues of DBPs and RBPs.The experimental results show that the multi-label sequence labeling framework outperforms the classification framework based on local protein fragments in identifying the binding residues of NABPs.(4)To solve the problems of existing methods that insufficient fusion of protein sequence and structure information and poor generalization ability,the method i Nuc ResASSH is proposed to identify NABRs based on a self-attention-based structure-sequence hybrid neural network.This method also used two evolutionary information profiles to characterize protein sequences,and integrate protein structure information from four aspects to construct residue feature vectors,including spatial geometric features,amino acid secondary structure,solvent accessibility,and atomic features.To reduce the sensitivity of the model to protein structure changes and enhance its generalization ability,the method uses a self-attention mechanism to learn the local structural patterns of residues from the structural context and uses a BGRU network to further fuse the local structure patterns and sequence order information.The binding residues of NABPs are detected by using a multilabel discriminant module.The experimental results show that a comprehensive fusion of protein sequence and structure information can improve the prediction performance,and i Nuc Res-ASSH achieves the best performance in the test scenarios with known and predicted structures,showing stronger generalization ability.
Keywords/Search Tags:protein function, nucleic acid binding proteins, binding sites, multi-label learning, deep learning
PDF Full Text Request
Related items