Application Of Class Imbalance Learning In Protein Subcellular Localization

Posted on:2022-07-09

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L W Wu

Full Text:PDF

GTID:1520306335995089

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Protein subcellular localization is a study to explore the corresponding relationship between protein and subcellular localization.The subcellular localization can annotate the structure and function of proteins,explore the pathogenesis of diseases and develop medicines in clinical treatment.Biological experiments were widely used to annotate the protein subcellular localization by researchers in the early stages of research,but the cost of such conventional experiments is very high.Therefore,machine learning based protein subcellular localization methods were developed accordingly.These methods annotate the subcellular location of the query protein by learning its feature distribution,and the accuracy of subcellular localization is improved by minimizing the location error.Due to the characteristics of protein data and environmental influences,the distribution of protein annotated with subcellular localization is generally imbalanced,which means the number of proteins belonging to one subcellular localization is far less than others.Protein data imbalance will seriously affect the model performance to predict the subcellular localization,but few works on this problem have been done.Among these works,the oversampling method SMOTE is mostly used to deal with the protein data imbalance.Nevertheless,there are some unsolved problems in SMOTE,such as sample stacking,lack of diversity,inclusion noise and so on.In order to solve the above problems,this thesis uses oversampling technology and generative adversarial networks as the theoretical basis to study the protein subcellular localization combined with the data characteristics of proteins.The specific research content of this thesis includes the following aspects:(1)This thesis proposes a new imbalanced learning method named Radius-SMOTE for protein subcellular localization with a single point.Radius-SMOTE generates samples of the minority class through nonlinear interpolation from the perspective of changing the sample value space,which can effectively mitigate sample stacking.At the same time,this thesis further proposes a protein subcellular localization model named R-PLoc based on Radius-SMOTE and verifies its performance on two benchmark datasets.(2)For the protein subcellular localization with a single point,the minority protein samples generated by conventional imbalanced learning methods are susceptible to noise and low diversity.Thus,this thesis proposes an oversampling method IRadiusSMOTE based on the K-nearest neighbor and feature space constraints.This method inserts new samples uniformly and conditionally from the feature space to solve data imbalance,which can effectively avoid the impact of noise and improve the diversity of generated samples.Meanwhile,this thesis further proposes a protein subcellular localization model IR-PLoc based on IRadius-SMOTE.Experiments have proved that this model has a better ability to deal with imbalanced protein data.(3)For the prediction of multi-label protein subcellular localization,the protein data imbalance will lead to data stacking among original sample classes.To solve this problem,this thesis proposes a new imbalanced learning method named SM-GAN,which includes two sub-models: generator and discriminator.The generator that fits the distribution of minority class can be obtained through a zero-sum game with the discriminator.This generator can learn the distribution and generate new samples of the minority class,so the direct constraint of the distance between samples is discarded.In addition,this thesis also proposes a new multi-label classifier named ML-Deep FM based on Deep FM,which converts the multi-label classification problem into the comparison of scores between each category.Finally,Gm-PLoc was proposed by combining SM-GAN and ML-Deep FM,and this model can effectively classify the subcellular localization of multi-label protein.

Keywords/Search Tags:

Protein subcellular localization, Class imbalance learning, SMOTE, Generative adversarial networks, Multi-label classification

PDF Full Text Request

Related items

1	Research On Protein Subcellular Location Classification Based On Feature Learning
2	Method Development For Predicting Protein Subcellular Localization Based On Deep Learning
3	A Method And Its Application Research For Protein Subcellular Localization Prediction Based On Multi-label Learning
4	Prediction Of Protein Subcellular Localization By Using Machine Learning Method And Its Application
5	Predicting Multi-label Protein Subcellular Location Based On Deep Learning
6	Research On Protein Subcellular Localization Prediction Under Multi-label Setting
7	Using Multi-label Learning Methods To Study Protein Subcellular Localization Prediction
8	A Multi-label Classifier Based On PSSM And GO For Predicting Protein Subcellular Localization
9	Research On Prediction Of Sequence-based Multilocus Subcellular Localization
10	Protein Subcellular Localization Prediction From Multi-label Learning