| Protein is a common macromolecular substance in living organisms,which plays an important role in gene regulation,transcription and translation.The binding of proteins and Deoxyribonucleic acid(DNA)plays an important role in regulating gene expression.This class of proteins is called DNA-Binding Proteins(DBPs).Accurate identification of DBPs is helpful to understand the pathogenesis of cancer,genetic diseases and other difficult diseases,and promote the research and development of related drugs.Traditional biological experimental methods have challenges such as high cost and long cycle,which cannot meet the application needs of high-throughput data.With the continuous accumulation of protein sequence data and the development of machine learning technology,DBPs prediction based on machine learning has received more and more attention.Many models have achieved certain results in predicting DBPs,but the existing prediction performance still cannot meet the requirements of practical applications.One of the key factors in improving the predictive performance of DBPs prediction models is to extract effective features from protein sequences.Therefore,this article studies effective feature representations for predicting DBPs,especially for extracting complex features of protein sequences using deep learning methods,The main research work includes the following three aspects:1.A multi-view integrated model for DNA-binding protein recognition is proposed.In order to prevent a single feature from comprehensively containing the required protein sequence-related information,protein feature-based scoring algorithms,including position-specific scoring matrix and predicted solvent accessibility probability matrix extracted the original feature space of the sequence information for the first time.Secondly,the local information and edge information of the protein are further extracted from the original feature space.Finally,feature vectors are input into Convolutional neural network and support vector machine to extract the deep shallow relationship of different features.At the classification level,a fully connected layer based on deep features and a support vector machine based on shallow features are weighted and integrated to obtain the final classifier.Experimental results on benchmark data sets prove that Convolutional neural network has significantly improved its performance compared with existing traditional classification models,and its integration with support vector machine further improves the classification accuracy.2.This paper proposes a multi-scale convolutional neural network model based on attention mechanism to identify DNA-binding proteins.This model introduces a convolutional block attention module based on the previous method,assigning different weights to the features extracted by each convolutional kernel,and automatically optimizing the weights through the learning process.Compare the performance of different embedding positions and calculation methods of attention modules on the benchmark dataset to establish the final classification model.The performance comparison on the independent test set shows that the introduction of attention module improves the performance of the original Convolutional neural network model.3.An attentional convolutional neural network model based on feature embedding was proposed to iden tify DNA-binding proteins.The model automatically generates embedded feature vectors of protein sequences through protein language model,and uses Convolutional neural network with attention module for feature extraction and classification.This article investigates two different protein feature embedding methods and attention modules,and trains two different models,before-flatten-SCA and ECA-Le Net.before-flatten-SCA implements a multi-scale self attention mechanism based on the idea of packet filling Convolutional neural network to process embedded features,and uses Spatial and Channel-wise Attention to mark key information and mine deeper semantic information;ECA-Le Net adds Efficient Channel Attention on the basis of Le Net Convolutional neural network.The experimental results demonstrate the effectiveness of the protein language model in generating features and the performance improvement of the attention module.To sum up,the first research content of this paper mainly focuses on feature extraction and processing,extracting different feature information from multiple perspectives to represent DNA-binding proteins.The second study focuses on whether the combination of attention mechanism and convolutional neural network is effective in predicting DNA-binding proteins,paving the way for the third work.The third research content introduces the protein language model on the basis of the above work to achieve the feature embedding of protein sequences,avoiding the artificial design of features,and uses the attention mechanism convolutional network to extract the discriminant features for the recognition of DNA-binding proteins. |