| RNA-binding protein(RBP)is a protein in eukaryotic cells that can bind to RNA.RNA binding protein is a powerful and extensive regulatory factor with significant biological functions,such as regulating splicing,RNA transport and other posttranscriptional processes,identifying special RNA binding domains and interacting with RNA.Therefore,rapid calculation,analysis and prediction are essential for a comprehensive understanding of RBP.The main work of this paper is to use evolution information,original sequence information,structure information,and dipeptide and tripeptide distribution information as feature expression methods.A variety of deep learning methods are used to construct sub-classification models for each feature.On this basis,the feature fusion method and Stacking integrated learning method are used to effectively fuse the output features of the sub-model.The research of this article specifically includes the following aspects:First,a training set containing more than 10,000 RBP sequences and three independent test sets were compiled: human,cerevisiae,and thaliana.To enrich the types of independent test sets,an independent test set of mice was constructed.Aiming at the problem of unbalanced sequence length distribution and unbalanced positive and negative sample size in the data;a sliding window is used to intercept sub-sequences,and the two parameters of window length and sliding step length are optimized to solve the problem of uneven sequence length distribution.Alleviate the imbalance between positive and negative sample sizes,and at the same time expand the sample size.Second,in view of the insufficient extraction of genetic and mutation information in RBP sequences by previous methods,this paper proposes four steps to obtain more abundant and effective features:(1)Using position-specific scoring matrix to encode and express the genetic evolution information of RBP sequences;(2)Design a deep learning model for evolutionary information including embedding layer,attention mechanism,LSTM and convolutional layer,as far as possible to capture the specificity and similarity of amino acids,while retaining the evolutionary information of the sequence;(3)To supplement structure information,add secondary structure information of RBP sequence;(4)Add the original sequence information after the amino acid embedding process.By adding specific structural information and original sequence information to the model,it can be combined with evolutionary information to complement each other,allowing the model to learn knowledge with stronger differences and richer information.Third,for the problem of insufficient peptides distribution information and insufficient feature diversity,a peptide distribution matrix containing dipeptide and tripeptide information is added;for the problem of too sparse tripeptide distribution matrix,the improved Max Pooling method is used to unidirectionally reduce the feature dimension;Finally,the Stacking method is used to integrate the characteristic information of each sub-model and introduce reliable evaluation indicators to comprehensively compare and analyze the classification performance of different parameters and different methods.The results show that our model outperforms the comparison method on the four validation sets,and can learn RBP feature knowledge more effectively.Related resources can be obtained at https://github.com/mmwangxu/Deep Fusion-RBP-tool. |