| In cellular activities,essential proteins play a crucial role and are indispensable components for the survival and reproduction of organisms.Therefore,predicting essential proteins is vital for understanding basic biological requirements,identifying pathogenic genes,and finding drug targets.With the rapid accumulation of high-throughput data,many methods for predicting essential proteins have emerged.However,these methods still face the following challenges:(1)network centrality-based methods perform poorly in complex protein-protein interaction(PPI)networks,while machine learning-based methods lack the utilization of the time and spatial characteristics of biological information.(2)Deep learning-based methods are not sufficiently researched on network dynamics.(3)Deep learning-based prediction methods lack research on multi-layer networks and semisupervised learning.To address these issues,this thesis explores how to effectively process and combine different biological information to identify essential proteins:(1)To address the limitations of centrality algorithms in complex networks and the insufficient utilization of biological information in machine learning algorithms,this thesis proposes the MBI-EP model for identifying essential proteins based on multi-source biological information fusion.Firstly,the model utilizes node2 vec technology to learn the continuous feature representation of proteins from the PPI network,comprehensively capturing the diversity of connection patterns in the network.At the same time,depthwise separable convolution is used to process gene expression data,observing the trend of gene expression changes over time in different experimental environments.In addition,the MBI-EP model uses a new subcellular localization processing method.Firstly,subcellular localization is ranked in descending order based on protein coverage,and then the top n subcellular localization partitions are selected and mapped to the features of nodes.Finally,the model integrates three types of features extracted from biological information for classification.Experimental results show that the MBI-EP model exhibits outstanding performance on the Bio GRID yeast dataset,with an accuracy of 90.48% and precision of73.06%,which is significantly better than traditional centrality methods and shallow machine learning methods.(2)By comparing the contributions of different types of biological information in the MBI-EP model to the prediction results,it is observed that gene expression data has the smallest contribution.To address the issue of low utility of gene expression data,as well as the insufficient application of deep learning methods in the study of dynamic PPI networks,this thesis presents a novel approach called ECD-EP for identifying essential proteins using evolutionary community detection.Firstly,the model combines gene expression data carrying time courses with static PPI networks and filters out abnormal edges at each time stamp according to the three-sigma rule,constructing a dynamic network that changes over time.Then,by utilizing birth and death information of edges,the model constructs an interaction stream source for the dynamic network and extracts overlapping communities in the network using evolutionary community detection algorithm.Then,the top m largest communities are selected,and they are mapped as features of the nodes.Finally,the model integrates the features extracted from subcellular localization data and community features for classification and prediction.Experimental results show that ECD-EP performs better than the MBI-EP model and all compared algorithms on multiple datasets,especially on the DIP dataset,where ECD-EP model’s accuracy,recall and F1 score are improved by2.55%,6.89% and 4.89% respectively compared to the MBI-EP model.(3)Through an analysis of the features employed in the MBI-EP and ECD-EP models,it is found that subcellular localization data exhibits the greatest contribution,whereas gene expression data exhibits the least.To tackle the challenge of effectively integrating biological information and the inadequate application of multilayer networks and semisupervised learning in the identification of essential proteins,this thesis introduces a novel protein identification model,termed MGCN-EP,which leverages graph convolutional neural networks and multilayer heterogeneous networks.Firstly,the model calculates the gene ontology-gene ontology similarity network and protein complex-protein complex similarity network based on gene ontology semantic similarity and Gaussian interaction profile similarity.Then,the association between PPI network and the other two networks is built through the subunits of protein complexes and gene ontology annotations of proteins.Finally,the multi-layer heterogeneous network is input into a deep learning model based on graph convolutional neural network for node classification optimization.To verify the model’s effectiveness,self-learning and traditional centrality methods are compared on different datasets.The experimental results show that MGCN-EP has the best overall performance on the Krogan yeast dataset,with an accuracy of 85.93% and a precision of79.36%. |