Font Size: a A A

Distributed Representation Of Amino Acids And Applications To Protein Sequence Analysis

Posted on:2020-09-22Degree:MasterType:Thesis
Country:ChinaCandidate:L HeFull Text:PDF
GTID:2370330575977339Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Protein is the material basis of all life,Without protein,there is no life,let alone human reproduction.Amino acids are the basic constituent elements of proteins.Proteins are arranged in a certain order by different kinds of amino acids,called protein sequences.By analyzing the protein sequence,you can further understand the spatial structure of the protein,which is crucial for the analysis of protein function and drug design,because the biological function of the protein depends largely on its spatial structure and the biochemical properties of the protein.And its functions are closely related to protein sequences,so protein sequence analysis is the premise and basis of protein structure analysis and even functional analysis.The first step in protein sequence analysis is to encode protein sequences.The most commonly used coding methods are one-hot coding,PSSM coding,amino acid vector coding,and one-hot coding to convert amino acid residues into orthogonal vectors.Considering the order between words and words and assuming that the vectors are independent of each other,although the calculation is simple,but the difference between the context and the order between the words is not well expressed,and the multiple sequence alignment is utilized.The PSSM spectral coding constructed by scoring method overcomes this shortcoming,but the iterative nature of the algorithm makes it very sensitive to bias in the sequence database.In particular,it is easy to erroneously incorporate a repeat sequence into the intermediate spectrum.The encoding of amino acid sequences by amino acid sequences generated by Word2 vec,although there are no repeat sequence errors,does not express a correlation between homologous sequences.Aiming at the different shortcomings of the above coding methods,this paperproposes an embedding vector representation method based on multi-sequence alignment spectrum k-mer amino acid sequence generation algorithm,which uses the protein similar sequence alignment spectrum as the input of the training embedding vector.Train Word2 vec to get the vector corresponding to each amino acid,which is the distributed vector of amino acids.The bidirectional recurrent neural network LSTM algorithm is applied to protein secondary structure and protein water solubility prediction.In the prediction process,the distributed vector of amino acids is used as an input vector,and the eight-class secondary structure and water solubility of the protein are predicted by the bidirectional LSTM algorithm.In this paper,the prediction effects of generating a distributed vector with a single amino acid and three linked amino acids were verified.The experimental results showed that the proposed distributed representation method applied to predict the protein secondary structure had achieved a prediction accuracy of 68.8% on the data set CB513.Water solubility achieves a 73.3% prediction accuracy on the data set SOLP.The experimental results show that under the two-way LSTM framework,only the multi-sequence alignment based amino acid distributed vector representation proposed in this paper is used as the model input,which is superior to or better than the current mainstream protein secondary structure prediction method and protein water-melting prediction method.According to research,similar protein sequences have high homology and substantially the same function,and amino acids at the same position in protein sequences which have the same function tend to be the same or have mutual substitution,furthermore,the structure and function can be inferred by comparing the protein sequence of a protein witch its structural and function is unknown with a similar protein which structural and function is known,so the embedded vector proposed in this paper is reasonable.Due to the strong correlation between the similar protein sequences,and recurrent neural networks can learn long-term dependencies and can adaptively perform parameter learning based on data.Therefore,the prediction effect of this paper is better.
Keywords/Search Tags:protein secondary structure prediction, Protein water solubility prediction, amino acid distributed representation, recurrent neural network
PDF Full Text Request
Related items