Protein subcellular localization is a study to explore the corresponding relationship between protein and subcellular localization.The subcellular localization can annotate the structure and function of proteins,explore the pathogenesis of diseases and develop medicines in clinical treatment.Biological experiments were widely used to annotate the protein subcellular localization by researchers in the early stages of research,but the cost of such conventional experiments is very high.Therefore,machine learning based protein subcellular localization methods were developed accordingly.These methods annotate the subcellular location of the query protein by learning its feature distribution,and the accuracy of subcellular localization is improved by minimizing the location error.Due to the characteristics of protein data and environmental influences,the distribution of protein annotated with subcellular localization is generally imbalanced,which means the number of proteins belonging to one subcellular localization is far less than others.Protein data imbalance will seriously affect the model performance to predict the subcellular localization,but few works on this problem have been done.Among these works,the oversampling method SMOTE is mostly used to deal with the protein data imbalance.Nevertheless,there are some unsolved problems in SMOTE,such as sample stacking,lack of diversity,inclusion noise and so on.In order to solve the above problems,this thesis uses oversampling technology and generative adversarial networks as the theoretical basis to study the protein subcellular localization combined with the data characteristics of proteins.The specific research content of this thesis includes the following aspects:(1)This thesis proposes a new imbalanced learning method named Radius-SMOTE for protein subcellular localization with a single point.Radius-SMOTE generates samples of the minority class through nonlinear interpolation from the perspective of changing the sample value space,which can effectively mitigate sample stacking.At the same time,this thesis further proposes a protein subcellular localization model named R-PLoc based on Radius-SMOTE and verifies its performance on two benchmark datasets.(2)For the protein subcellular localization with a single point,the minority protein samples generated by conventional imbalanced learning methods are susceptible to noise and low diversity.Thus,this thesis proposes an oversampling method IRadiusSMOTE based on the K-nearest neighbor and feature space constraints.This method inserts new samples uniformly and conditionally from the feature space to solve data imbalance,which can effectively avoid the impact of noise and improve the diversity of generated samples.Meanwhile,this thesis further proposes a protein subcellular localization model IR-PLoc based on IRadius-SMOTE.Experiments have proved that this model has a better ability to deal with imbalanced protein data.(3)For the prediction of multi-label protein subcellular localization,the protein data imbalance will lead to data stacking among original sample classes.To solve this problem,this thesis proposes a new imbalanced learning method named SM-GAN,which includes two sub-models: generator and discriminator.The generator that fits the distribution of minority class can be obtained through a zero-sum game with the discriminator.This generator can learn the distribution and generate new samples of the minority class,so the direct constraint of the distance between samples is discarded.In addition,this thesis also proposes a new multi-label classifier named ML-Deep FM based on Deep FM,which converts the multi-label classification problem into the comparison of scores between each category.Finally,Gm-PLoc was proposed by combining SM-GAN and ML-Deep FM,and this model can effectively classify the subcellular localization of multi-label protein. |