Protein subcellular localization can provide valuable information for revealing pathogenic mechanisms,disease treatment,and drug design.With the rapid accumulation of protein sequences in biological databases,protein subcellular localization methods based on machine learning and deep learning have become a research hotspot in recent years.However,the high complexity of protein sequences makes it difficult for some existing methods to achieve ideal localization prediction performance.To address the characteristics of unequal length,uneven feature distribution of protein sequences,this article investigates an improved subcellular localization prediction method based on Deep Temporal Convolutional Network(TCN)to fully extract localization information from protein sequences to obtain better localization prediction performance.First,to fully extract the feature of different levels of protein sequences,this article proposes a method of protein subcellular localization prediction based on Multi-layer Feature Fusion TCN(MFTCN).Firstly,TCN is used to mine the localization information of protein sequence,and then within-layer attention is used to summarize the within-layer information to obtain the feature of different levels.Further,to fully mine the information of different layers of the sequences,the features of all layers are fused using multi-headed hierarchical attention.The experimental results on Deep Loc and Uni Loc datasets show that the MFTCN based protein subcellular localization prediction method can better mine the variability of sequences at different subcellular locations,resulting in better localization prediction performance.Then,to address the small size and extreme imbalance of protein sequences dataset with subcellular marker information,this article proposes a method of protein subcellular localization prediction based on Data Augmentation MFTCN(Aug-MFTCN).Firstly,a protein sequence generation method based on pseudo-mutation strategy and variational self-encoder is designed to expand the training set in a balanced way,and a sequence evaluation method is constructed to filter the generated sequence.Further,we borrow the idea of adversarial learning to construct more targeted adversarial samples for online data augmentation using the gradient information in the training process to improve the generalization of the model.Experimental results on the Deep Loc dataset show that the Aug-MFTCN based protein subcellular localization prediction method greatly improves the localization prediction performance on subcellular locations with a small number of samples,and demonstrates superior generalization over the MFTCN method on the generalization test dataset.Finally,to address the problem that existing methods do not consider the priori information of protein sequences,this article proposes a method of protein subcellular localization prediction based on Knowledge Integration MFTCN(KI-MFTCN).Firstly,a Protein-BERT model is constructed to observe a large amount of unlabeled protein sequence to obtain feature vectors that can fully characterize protein information.Further,priori sorting signal information in protein sequence is used to construct an auxiliary learning task for the model to guide it to focus those amino acid peptide chains that play an important role in protein localization.Experiments on Deep Loc and Uni Loc datasets show that the information mined from a large amount of unlabeled protein sequence data and priori sorting signal information in protein sequence can effectively help improve protein subcellular localization prediction performance. |