Font Size: a A A

A Multi-label Classifier Based On PSSM And GO For Predicting Protein Subcellular Localization

Posted on:2017-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:B J LiuFull Text:PDF
GTID:2310330512970728Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A cell is a highly structured unit.The internal structure of a cell is complex and delicate,and we can define a variety of subcellular structures from it,based on the differences in function and spatial distribution.Protein plays a fundamental role in many biological processes.The normal life needs all proteins at their specific subcellular location for the particular function,respectively.An increasing number of proteins can simultaneously reside at two or more different subcellular locations,meaning that they have multiple locations.The prediction of protein subcellular locations can be viewed as a multi-label classification problem.According the related work,the performance of protein subcellular location prediction often depends on feature generation and prediction algorithm.As inspired by present works,the work of this paper mainly includes two aspects introduced above.This paper proposed a robust feature generation method which would be explained from following three aspects.Firstly,it constructed feature vector to explain the target problem based on the information extracting from Position Specific Scoring Matrix and GO annotations.Secondly,it considered the effect of evolutionary information,and employed text mining method Logarithmic Transformation of CHI-square to determine a weight for GO term to further distinguish the different identification ability of different terms.Thirdly,it applied Hilbert-Schmidt Independence Criterion(HSIC)technique to reduce the dimensionality of features to reduce the redundant information,and it also could contribute to simplifying the model built in the next work.To improve the performance of protein location prediction,an effective feature generation method is a precondition,and the predicting algorithm also plays an important role.Predicting protein subcellular localizations can be viewed as a multi-label classification,so the predict result has diversity,uncertainty.This paper applied neighborhood rough set,which was always used to solve fuzziness problems,combining with the information of the relevance among labels,to build a model to predict protein subcellular locations.Firstly,in order to improve the tolerance of noise data of the model,it introduced variable precision to define the upper(lower)approximation to the concepts of neighborhood rough set.Secondly,according to the analyzing of the relevant biological processes,we found that the subcellular location labels have certain correlations,so the characteristics of the label correlations were adapted to build the model for predicting protein subcellular locations.The proposed method was performed on two benchmark datasets Viral-proteins and Plant-proteins,with using the widely accepted evaluations to evaluate the performance.And the predict result was compared with some existing method.Finally,the experimental results showed that our prediction method was surpassed existing methods.Finally,we given a summation to this whole paper,and propose many useful recommendations for the future work.
Keywords/Search Tags:Protein Subcellular Location, Position Specific Scoring Matrix, Gene Ontology, Reduce Feature Dimensions, Multi-Label Classification, Neighborhood Rough Sets
PDF Full Text Request
Related items