Font Size: a A A

Classification Models For Predicting RNA Binding Proteins

Posted on:2022-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:J RaoFull Text:PDF
GTID:2480306773967939Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Nucleic acid binding proteins(NABPs)are a class of proteins that play a role by binding with nucleic acids.They are divided into deoxyribonucleotide(DNA)binding proteins and ribonucleotide(RNA)binding proteins.NABPs play an important role in many life activities,such as gene transcription,translation,gene regulation and so on.With the continuous in-depth research of protein engineering the deep integration of interdisciplinary development,the prediction of NABPs,especially RNA-binding proteins(RBPs),has become one of the hotspots in the field of bioinformatics.At present,the research on RBP is mainly divided into traditional biological experimental methods and computational methods.The traditional biological wet experimental methods often have the limitations of a long experimental cycle,high human and material cost,experimental difficulty and uncertain experimental environments.Therefore,they cannot deal with the explosive growth of protein sequence data.With the development of multi-disciplinary integration,especially artificial intelligence,machine learning technology has been applied to the field of RNA binding protein prediction,and achieved good results,but there are still some limitations and deficiencies:(1)The class distribution of real RNA binding protein data is highly unbalanced,which will make the classifier tend to most classes,resulting in poor performance in identifying RNA binding proteins;(2)The dimension of feature vectors used to describe proteins is often high,which may lead to dimension disaster when the number of samples is insufficient,and the classification performance of machine learning methods may be reduced to a certain extent;(3)The common physicochemical property and evolutionary conservation-based features are often complex and artificially designed,which is not conducive to the representation learning ability of deep neural networks.Moreover,the acquisition of evolutionary conservation generally needs special tools to make multi-sequence alignments on a large-scale protein database,which has high computer resources and long computing cycles and limits the broad application of models.Therefore,this paper will further improve the RBP prediction model by using the methods of machine learning and deep neural network.In this paper,a new RBP classification and prediction model are proposed to solve these problems.The main contents include:(1)A feature selection-based support vector machine method based on XGBoost.The model is mainly aimed at the classification and prediction of small-scale data,which can effectively improve the prediction accuracy of the model without using biological information features.XGBoost is used to select the optimal subset from the statistical characteristics of tripeptide patterns of high-dimensional protein sequences,and the SMOTE algorithm is used for data balancing.The experimental results show that the method is effective;the results on the standard datasets are significantly better than those of other models,and more than 2% improvement is obtained in the two global performance indexes of MCC and AUC.(2)RBP was predicted based on the method of short amino acid frequency of deep learning.The model is mainly aimed at the classification and prediction of large-scale data.The protein sequence data is regarded as similar text data,the dipeptide and tripeptide frequencies are used as similar text word frequencies,and the fusion method of the XGBoost algorithm and convolutional neural network is used to predict multi-species RNA-binding proteins.The results show that in the 10 x cross validation,the AUC of the model on Human,E.coli,and Salmonella datasets is 0.94,0.97,and 0.94,respectively,and the MCC on independent test sets is 0.66,0.68,and 0.73,respectively.The AUC values reached 0.91,0.96,and 0.91,respectively.In addition,the model shows higher stability and better generalization performance in crossspecies testing.This provides a new idea for the study of cross-species RBP.
Keywords/Search Tags:Protein sequences, Machine learning, Convolutional neural network, Feature selection, RNA binding proteins
PDF Full Text Request
Related items