| Protein is an important component of the biological world and participates in most life activities.In most cases,proteins do not work alone,but combine with other proteins through chemical bonds to realize their functions in cells.Accurate identification of protein-protein interaction(PPI)sites is of great significance to the fields of medicine,pharmacy,and genetics.Current methods for identifying PPI sites are mainly divided into sequence-based methods and structure-based methods.Sequence-based methods only use protein sequences,but due to the lack of structural information,the effect of this type of method is still far behind that of structure-based methods.The current trend of structure-based methods is to use structured data with graph neural networks,but due to the limitations of the graph network itself,there is still room for improvement in the performance of such methods.In view of the above problems,this thesis proposes a new sequence-based PPI site prediction method,the main work is as follows:(1)This thesis introduces transfer learning from two perspectives of feature input and model training,and implements a two-stage transfer learning strategy.In terms of features,introduce the largest protein pre-training language model ESM-2 to encode the sequence.Due to its huge training data and model parameters,the encoding of ESM-2 contains more potential information,which can make up for the insufficient information of sequence-based methods to a certain extent.In terms of model training,we introduced the protein-peptide binding residue data set to pre-train the model,so that the model has a good initial parameter,and then transfer the parameter to the PPI site prediction task for fine-tuning,so as to improve the model performance purposes.(2)In terms of the network framework,this thesis introduces a dynamic graph convolutional neural network as the basis of the model in this thesis.At this stage,the traditional graph neural network has been widely used in the major prediction fields of proteins.Although the graph neural network has shown good performance,its adjacency relationship is fixed before input(usually set by calculating the distance between residues),the network cannot fully mine the node interaction relationship in the high-dimensional semantic space.The dynamic graph convolutional neural network is different from the traditional graph neural network.Its adjacency relationship is not fixed.Instead,the k-nearest neighbor algorithm is used at different levels of the network to recalculate the "feature distance" between nodes to construct a new graph structure.The feature extraction method of this dynamic graph enables the network to flexibly adapt to different feature spaces and deeply mine the connections between nodes..(3)In order to verify the performance of the model in this thesis,we conducted experiments on the two main data sets used for PPI site prediction.From the perspectives of overall performance,module effectiveness,model performance impact,and result visualization,and with seven evaluation indicators,the effectiveness of the method in this thesis is comprehensively discussed.The experimental results show that the performance of the method in this thesis on the two public datasets has reached the current optimal effect,especially on the main dataset Dataset 1,which is 5.9% higher than the F1 value of the current advanced method EGRET,and AUROC is higher than4.9%,AUPRC 10.1% higher and MCC 13.3% higher. |