Font Size: a A A

Predicting Subnuclear Location Of Proteins And Subcellular Location Of Ncrnas Based On Multi-Information Fusion And Multi-Label Ensemble Classifier

Posted on:2014-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y PengFull Text:PDF
GTID:2250330398996534Subject:Physics
Abstract/Summary:PDF Full Text Request
Subcellular localizations of protein and non-coding RNA are crucial to understanding of biomacromolecular interaction and function as well as research and development of drugs, but it is high temporal cost and expenses incurred using experimental methods, therefor, an efficient classification system is highly desired. In this article, subcellular localization prediction of protein and non-coding RNA were discussed in detail by eight steps:bench-mark datasets construction, feature representation, feature selection, classification algorithms, validation methods, evaluation indexes, case study and result analysis.First of all, the datasets of eukaryotic proteins subnuclear localization and archaea proteins subcellular localization were constructed, and the four kinds of features information were extracted:sequence information, evolution information, gene ontology annotation and protein posttranslational modification. Thereinto, the sequence informations are the split amino acid compositions and the filtered tri-peptide compositions by using binomial distribution; the evolution informations are the filtered sequence profiles which are obtained by computing the sequence profiles of conserved sites of protein sequence based on Position Specific Scoring Matrix; the gene ontology annotations are that a protein sequence is represented by using the gene ontology terms of its homologous proteins, then the GO terms were filtered by Shannon information and the filtered GO terms of a protein were translated into a logic vector; the protein posttranslational modifications refer to compute the sites number of phosphothreonine, phosphoserine, phosphotyrosine, acetylation and methylation. Then, the four kinds of features were input into binary classifier systems of KNN and SVM, respectively. The one or multiple locations of proteins are determined by majority voting procedure. Finally, the obtained results of the classification systems are higher than other methods in leave-one-out cross-validation.For non-coding RNAs, predicting their subcellular localization is researched firstly in this article. A subcellular localization dataset of ncRNA is firstly constructed. Then based on the fused three kinds of k-mer information, their subcellular localizations are predicted by ensemble classifier of KNN and SVM. A total accuracy of86.75%was achieved.At the end of this article, the above mentioned three studies are summarized and some further ideas for predicting subcellular localization of protein and ncRNA are proposed.
Keywords/Search Tags:protein subnuclear location, non-coding RNA, subcellular localization, multi-features fusion, feature selection, SVM-KNN multi-label ensemble classifier
PDF Full Text Request
Related items