| Biomacromolecules play an important role in the process of life activities.DNA biomacromolecules are the main genetic material,and protein biomacromolecules are the main carriers of life activities.The study of biological macromolecular has always been one of the main research directions of modern biology.The combination of biological macromolecules,such as transcription factors-transcription factor binding sites,MHCs I-peptides,has an important effect on gene regulation and expression,human adaptive immunity and other functions.Models inferring biomacromolecules binding can serve as adjunct tools for clinical immunotherapy.Traditional biological experiments are time-consuming and costly,and cannot be applied to large-scale binding research and analysis of macromolecules.In the face of a large number of new biomacromolecule sequencing data,efficient computational methods are urgently needed to identify the binding of biomacromolecules.Based on deep learning algorithm,this paper conducts research on macromolecular recognition of transcription factor binding sites and MHC Ⅰ-peptide binding prediction.The specific research contents are as follows:(1)In the research on transcription factor binding site recognition,traditional biological experiments cannot perform large-scale and rapid applications in the face of massive DNA-protein binding data.In this paper,we propose an attention-based deep learning hybrid model,Deep ARC,to rapidly identify transcription factor binding sites.In terms of feature embedding,Deep ARC adopts a position-based feature encoding method,which can effectively combine the advantages of One Hot encoding and word2 vec distributed encoding.In terms of model architecture,Deep ARC adopts a hybrid architecture of convolutional neural network + recurrent neural network + attention mechanism + deep fully connected neural network.Furthermore,Deep ARC can also generate DNA attention weight heat map through the attention mechanism,which introduce interpretability for the transcription factor binding site identification studies.The final test results on the ENCODE project show that the ROC-AUC of Deep ARC reaches 0.908,and the model performance is better than that of Deep TF,CNN-Zeng and Deep Bind.(2)In the study of MHCs I-peptides binding prediction,traditional methods use deep learning techniques to extract hidden features of MHCs I and peptides,which ignore the effect of sequence similarity on the binding of MHCs I and peptides.In this paper,the MHCs I-peptides heterogeneous network is constructed based on the entity sequence similarity,and the MHCs I-peptides binding prediction method based on the heterogeneous network-HNMHC is studied.Firstly,HNMHC used Sim Hash and Clustal Omega to calculate the peptides-peptides sequence similarity matrix and the MHCs I-MHCs I sequence similarity matrix,respectively,and combined the MHCs I-peptides binding data provided by the IEDB database to construct MHCs I-peptides heterogeneous network.Then,HNMHC uses the heterogeneous network feature embedding method BHIN2 vec to perform feature embedding on the MHC Ⅰ and peptide entities in the heterogeneous information network.Finally,prediction are done using only simple fully connected neural networks.The test results on the IEDB dataset show that the ROC-AUC of HNMHC reaches 0.951,and the PR-AUC reaches 0.982.Compared with the deep learning-based methods MHCAttn Net and MHCflurry,HNMHC has better prediction performance and training speed.(3)In order to solve the difficult problem of using deep learning models in biological research,a web platform for transcription factor binding site identification and MHCs I-peptides binding prediction was constructed.The platform provides a friendly interface and instructions for modern biological researchers to operate the model prediction function.Users can use the model’s prediction function,data download function and sequence visualization function through a graphical web interface,reducing the threshold for using deep learning models for modern biology researchers. |