Font Size: a A A

Research On Protein Subcellular Localization Prediction Based On Multi-label Learning

Posted on:2022-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:J YuFull Text:PDF
GTID:2504306551982299Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Protein plays an import role in cell life.Determining the subcellular localization of proteins is important for exploring protein function and designing drug.With the explosive growth of protein sequence in the post-gene era,traditional experimental methods can’t meet the demand for subcellular localization of a huge number of proteins.To effectively solve this problem,the researchers introduce it into machine learning.The traditional prediction of subcellular localization is mainly aimed at single site proteins.In fact,a large number of proteins may simultaneously exist at,or move between,two of more different subcellular location.Therefore,the prediction of multi-site protein has more practical significance.1.We propose a subcellular multi-site prediction model for human proteins.The GO annotation method based on Bayesian statistics was used for feature extraction,and the extracted feature vectors were input into the Classifier Chain(CC)based on k-Nearest Neighbors(k NN)algorithm to predict the possible subcellular locations of protein sequences.If cannot get the GO feature of protein sequences,combine the Chou’s pseudo amino acid composition(Pre AAC)method and N-terminal signals method for feature extraction,then remove redundant features and invalid by Multi-Label Relief F(ML-Relief F)method.Finally,the selected feature subset is input to the CC classifier to predict rest protein sequences.The jackknife method was used for cross-validation,and the results show that the method is feasible.2.Based on the Multi-Label Synthetic Minority Over-Sampling Technique(ML-SMOTE)algorithm,an improved oversampling algorithm for balanced multi-label datasets,PMLSMOTE,was proposed to preprocess imbalanced protein sequence dataset.The algorithm balances the dataset by generating new samples to improve the prediction performance of the model.3.A prediction model for protein subcellular localization of Gram-positive and Gramnegative bacteria based on PSSM-MLSMOTE method was proposed.Firstly,consensus sequenced-based occurrence(AAO)and evolutionary-based occurrence(PSSM-AAO)method were used to extract the features of protein sequences,and fuse those two methods.Then the PML-SMOTE method was used to preprocess the dataset.Finally,the classification prediction model was obtained by using the Multi-Label k-Nearest Neighbor(ML-k NN)algorithm.Through jackknife cross validation,the proposed PSSM-MLSMOTE model can effectively predict the multi-site protein subcellular location.
Keywords/Search Tags:Protein Subcellular Localization, Position Specific Scoring Metrix, Gene Ontology, Feature Fusion
PDF Full Text Request
Related items