Font Size: a A A

Predicting Protein Molecular Functions Based On Sequence Features

Posted on:2007-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:R BiFull Text:PDF
GTID:2120360242461827Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Gene Ontology (GO) is a common language for the functional annotation of gene products. In recent years, many computational methods for the functional annotation of gene products have been developed based on GO. A computational tool, GOKey, is developed to predict the GO function of proteins based on their sequence features and the support vector machine (SVM) method. The features used in GOKey include the amino acids composition, hydrophobicity, normalized van der Waals volume, polarity, polarizability and charge. Several measures, including improved handling of the problem caused by unbalanced positive and negative training data and postprocessing strategies to evaluate the posterior probability and statistical significance of SVM outputs, have been adopted to improve the prediction performance of GOKey.The results of 5-fold cross validation with 10603 GO-mapped proteins demonstrate that the performance of GOKey is better than that of standard SVM. Comparisons with other computational tools for GO function prediction also show that the performance of GOKey is satisfactory. Further, GOKey has been applied to predict the GO functions for 5381 novel human proteins in the Ensembl database. The results show that 93% of the novel proteins can be assigned one or more GO terms, and some evidences supporting the predictions have been found. GOKey can be accessed via http://infosci.hust.edu.cn.In addition, an improved hexamer usage preference model, which can effectively reduce the dependence of coding potentials on C+G content by considering both the usage frequency and C+G content of hexamer, is presented. Compared to some widely used coding measures, the proposed method needs less training data while performs better in the recognition of protein coding regions.
Keywords/Search Tags:Protein function, Gene Ontology, Support Vector Machines, Statistical significance, Hexamer usage preference
PDF Full Text Request
Related items