Font Size: a A A

Embedding Method Based On Protein Sequences And Optimal Conditions Analysis

Posted on:2020-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:X J LiFull Text:PDF
GTID:2370330572484268Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Protein is a compound with a complex structure and an important component of cells.Enzymes are a class of catalytic proteins that can only catalyze in specific environmental conditions such as acidic environment or high temperature environment.The environmental conditions capable of maximizing the catalytic action of the enzyme are referred to as optimal conditions.Wild-type enzymes in nature can not perform well catalytic action under the conditions set by the researchers,Therefore,it is a hotspot of biology research to get the optimal conditions of enzymes and use protein engineering to make mutants play a good catalytic action in the expected environment.Biological researchers typically use gradient tests to get optimal conditions for wild-type enzymes.Then,they use the tertiary structure of protein to analyze the correlations between the structure and the optimal conditions,and get multiple mutants of the wild-type enzyme by directed mutation.Finally,they find the mutant that can play a good catalytic action in the expected environmental conditions from multiple mutants.However,the gradient test process is complex,one experiment can only perform on a single enzyme,and the efficiency of get optimal conditions is low.Although directed mutations can get the mutant that can play a good catalytic action in the expected conditions,it is uncontrollable to mutate wild-type enzymes.Biologists need to perform multiple mutations on one enzyme to screen the expected mutants.Therefore,it is difficult and inefficient to get a mutant that has expected optimal conditions using conventional biological methods.In view of the above problems,this paper explores the correlations between the optimal conditions and the sequence based on the amino acid sequence of the enzyme.We propose an embedding method to represent the amino acids and the construct information as vectors in the latent space.These vectors contain information about the correlations between amino acids and sites and the information about the correlations between amino acids and sites and optimal conditions.Using these vectors,we design a compatibility score to assess the compatibility of amino acids with sites.In this paper,four attempts are made by using compatibility score(1)Analysis of conserved and non-conserved fragments of the enzyme.If the compatibility scores of certain sequence fragments or sites with all kinds of amino acids are higher than other fragments or sites,it indicates that the fragments or sites can affect the optimal conditions,then the sequence fragments or sites are non-conserved fragments.Otherwise,they are conserved fragments(2)Predict the optimal conditions for a given enzyme.We use the sequence site and the amino acid compatibility score as the characteristic value of this site,convert the amino acid sequence into the eigenvector,and use the regression model to predict the optimal conditions of the enzyme.(3)Given an expected conditions,mutation suggestion for the wild-type enzyme.We find the site with the lowest compatibility score in the non-conserved fragment,and improved the compatibility score of the site by replacing the amino acid in this site,so that the optimal condition of the mutant was closer to the expected condition.(4)Design an amino acid sequence that has the expected optimal conditions.We refer to biological knowledge and norms to convert the compatibility score into the probability of an amino acid appearing at a certain site.We select the appropriate amino acid for each site based on the probability distribution of the amino acids in each site to generate a new amino acid sequence that has the expected optimal conditions.For the practical usage,we crawl the amino acid sequence in glycoside hydrolase GH11 family from CAZY website and collect the optimal pH of 125 enzymes that are determined from related papers.Since the size of a family dataset is small,we adopt the probability approximation method to realize the embedding method of small samples.Compared with the traditional methods of biology,the embedding method of this paper is faster and better.Compared with other calculation methods,the embedding method has less input and the effect is betterIn order to facilitate the use of the embedding method by biological researchers,this paper develop a visualization tool for protein family embedding learning.The tool provides easy model debugging and a model evaluation interface that allows biologists to modify and use the embedding model without having to know the computational knowledge.
Keywords/Search Tags:Amino Acid Structure, Optimal Conditions, Embedding, Directed Mutation, Non--conserved Fragment, Visualization Tool
PDF Full Text Request
Related items