| As a branch of proteomics research,protein subcellular location prediction plays an important role in many ways,such as exploring the specific functions of proteins and the mechanism of protein interaction.In this thesis we focus on the prediction of multi-label protein subcellular location by deep learning.The main work is as follows.First,we extract 3622 immunohistochemical images(IHC)from the Human Protein Atlas database according to certain rules and separate protein channels from IHC images.We use LBP method to extract features,and then transfer the multi-label problem into multi-class problem.Next,we use SVM,KNN and XGBOOST method to predict,at the same time,for better prediction,we generalize the binary focal loss to the multi-class focal loss as the loss function of XGBOOST.We can find that SVM reaches the maximal accuracy with 77.4%,but its Macro-F1 is 0.XGBOOST with Focal loss performs well,its Macro-F1 is 12.48%and accuracy is 60.4%.But the simplest method KNN performs best,its accuracy is 69.4%and Macro-F1 is 16.54%.From the analysis of the results,the reasons for affecting the prediction ability may be that the internal imbalanced distribution of each label,and the problem of label co-occurrence.Second,we train ResNet-18 on training set(with BCELoss),in the training process,we select the model with the largest metric value(metric:MacroRecall multiplied by minimum Recall for each label under the condition that the accuracy on the validation set is greater than 60%),as the final model for prediction.The accuracy on the test set is 59.4%.At the same time,we propose a new loss function to solve the problem of label internal imbalance and label co-occurrence.Network with the newly proposed loss function is trained and optimized under the same conditions,and it is found that its accuracy on the test set is 58.7%,which is almost the same as the model learned by BCELoss,however,its Macro-Recall and Macro-F1 are increased by 9.3%and 4.7%respectively,especially in cytoplasmic and membranous,the Recall value is increased by 16%and 24%respectively,so we believe that the newly proposed loss function significantly improves the prediction of multi-label proteins relative to BCELoss.Finally,we summarize the work of this thesis and give an outlook on the follow-up research work. |