Font Size: a A A

Search and analysis of the sequence space of a protein using computational tools

Posted on:2007-07-08Degree:Ph.DType:Thesis
University:Georgia Institute of TechnologyCandidate:Dubey, AnshulFull Text:PDF
GTID:2450390005986771Subject:Chemistry
Abstract/Summary:
The application of enzymes as catalysts for industrial processes spawned what is now a rapidly growing field of biocatalysis. Numerous enzymes have been found and characterized according to their functions and/or three-dimensional structures. Directed Evolution (DE) is a field of research in biocatalysis, where mutations are made in the sequence of a native or what can be called a wild-type enzyme. These mutations are made at random, with the purpose of finding an alternate sequence or a variant to the wild-type enzyme, which shows an improvement over it for a specific property. The sequence space of an enzyme refers to all the possible variants, which can be created from it. Due to an immensely large number of such sequences, making mutations at random is definitely not an optimal strategy. However, due to the absence of a proper understanding of how the sequence of a protein translates into its function, or a sequence-to-function map, DE is usually the only available option.; In this thesis, a computational approach to improving the process of DE is presented, which involves using machine learning algorithms. When any enzyme is subjugated to DE, a large number of its variants are created, which are analyzed through high-throughput screening methods. The screening results provide us the measure of the property of interest, like catalytic activity towards a specific reaction, for each of these variants. This data can be utilized to search for patterns in the sequence space, which can lead us to an understanding of how the function is related to an enzyme's sequence. However, the critical limitation to this approach is the scarcity to data because sequencing the created variants is a sizeable task and only a relatively small number can be expected to be available. Most machine learning methods, on the other hand, usually required a large number of examples in the data set. To circumvent this constraint, a simplifying assumption was made, whereby, each variant was divided into two classes---positive and negative. This criteria for this division can be selected based on the measured property of interest for the variant, according to the screening method. Such an assumption reduces the problem to a case of non-linear classification into binary classes. Efficient algorithms have been developed for such problems, which may be able to give pertinent results from the available data.; Chapters 1 and 2 of this thesis introduce the basic concepts of protein engineering and Directed Evolution. The suggested approach of using machine learning to analyze the sequence space of any protein or enzyme is described. Chapter 2 also provides background information on the different experimental procedures, which are performed during a DE process. It also mentions the research done in the field of applying different computational strategies to improve DE. Background is provided on the different machine learning algorithms, which were used with the data available from DE.; Support Vector Machines (SVMs) are a recently developed class of algorithms, which are primarily used for non-linear classification. An SVM was formulated to identify important amino acids in the sequence of a protein, which is described in Chapter 3. An important amino acid residue was defined as one, which, if mutated, will result in an inactive variant. The data used were the variant sequences containing random mutations created by using different protocols of DE. Based on their screening, they were classified as positive, if they had any catalytic activity, or negative, if they had none. This algorithm was applied to the TEM-1 beta-lactamase sequence. The reason for this choice was the availability of known significant amino acid residues for the TEM-1 beta-lactamase sequence, which were found through extensive experiments. In silico or computer-generated variants were created by simulating the different protocols of DE. It was shown that the SVM can efficiently identify such...
Keywords/Search Tags:Sequence, Protein, Using, Different, Variants, Enzyme, Created, Computational
Related items