| Essential proteins are indispensable for organism survival and the maintenance of basic cell and tissue functions.Using efficient computational methods to identify human essential proteins can accelerate the understanding of the laws of cellular life activities and the evolutionary nature of organisms,and provide a theoretical basis for the discovery of novel drug targets and promote the development of precision cancer medicine.With the discovery and application of CRISPR-Cas9 gene editing technology in recent years,human gene and protein essentiality and its related data have grown rapidly,bringing unprecedented opportunities and challenges to human protein essentiality research.In this paper,we are dedicated to propose a more accurate,reliable and practical human essential protein prediction method based on sequence information in combination with deep learning technology,and design different deep learning models for feature extraction and prediction for the geneticdependent and environment-dependent properties of essentiality to achieve more efficient essential protein identification.The main work and innovation points of this paper can be summarized as follows.(1)To address the shortcomings of existing methods using sequence intrinsic-features and traditional machine learning that make it difficult to fully represent and effectively extract sequence features,this paper proposes an integrated deep learning-based essential protein prediction method EP-EDL from three key factors affecting model construction:sequence representation,feature extraction and class imbalance problem.To obtain a highly informative sequence representation and extract effective features,EP-EDL uses a position-specific score matrix with embedded evolutionary information to characterize protein sequences and mines potential features of the sequences for prediction by a multi-scale textual convolutional neural network.To mitigate the impact of the class imbalance problem,EP-EDL uses an under-sampling-based ensemble learning strategy to improve prediction accuracy and robustness.The results demonstrate that the prediction performance of EP-EDL outperforms the existing state-of-the-art sequence-based prediction methods,providing a more applied and accurate essential protein identification method for relevant biologists.(2)To address the feature that human protein essentiality varies in different cellular environments,this paper proposes DeepCellEss,a cell line-specific interpretable deep learning prediction method based on the attention mechanism.DeepCellEss performs model training and prediction on 323 cancer cell lines.The model uses convolutional neural networks and bidirectional long short-term memory networks to learn local dependencies and long-range dependencies between amino acids in sequences,and dependencies a multi-headed self-attention mechanism to achieve amino acid-level essentiality contribution analysis for accurate cell line-specific critical protein prediction and interpretation.The results show that the model structure of DeepCellEss is designed to be the optimal combination for the current prediction task,achieves better prediction results on multiple cell line data independent test sets,has the best prediction performance compared with all the latest sequence-based comparison methods,and can be effectively applied to the prediction of essential proteins in different cell types. |