Font Size: a A A

Predicting protein function using sequence derived features selected by genetic algorithms

Posted on:2009-10-05Degree:Ph.DType:Dissertation
University:Columbia UniversityCandidate:Kernytsky, AndrewFull Text:PDF
GTID:1440390002494388Subject:Chemistry
Abstract/Summary:
Large scale sequencing of genomes has created a peculiar problem for biology: there is now a glut of information in the form of nucleotide sequences, but deciphering the higher level annotations buried in the nucleotides remains a problem that is unsolved to varying extents. One aspect of this problem is deducing the function of a protein from its sequence. This is an important challenge because the number of raw protein sequences far surpasses the number of well characterized proteins. We address this problem by using computational methods to predict protein function from sequence derived features.; We describe a method for the prediction of protein function in terms of enzymatic activity classification (Enzyme Commission numbers) using only the protein sequence. The method begins by generating sequence derived features for a protein that range from the amino acid composition to predicted features such as secondary structure and solvent accessibility. In order to capture the local environment surrounding a key residue---a residue involved in catalysis, for example---the method searches for combinations of these features that have predictive power when they occur at the same residue. The learning algorithm may find that a particular amino acid residue is a good indicator of some protein function when another sequence derived feature indicates that the residue is predicted to be at the surface of a protein or to be in a beta sheet secondary structure element.; These predictive combinations of features are detected by a genetic algorithm used as a wrapper around a neural network. By incorporating features in the environment surrounding a single residue, the method may be seen as a specialized motif detector that detects instances of these combined features that are correlated with protein function. We evaluate the performance of this method across 59 enzymatic activity classes and find that the genetic algorithm based selection of feature combinations is able to significantly increase the predictive power of the method.
Keywords/Search Tags:Protein function, Sequence derived features, Genetic, Algorithm, Method, Using, Problem
Related items