Font Size: a A A

Protein Sequence Analysis Based On Transformer Model

Posted on:2021-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:S Q WangFull Text:PDF
GTID:2370330629452699Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Protein is an important component of all cells and tissues of the human body,and proteins are involved in all important life activities of the human body.Amino acids are the basic constituent elements of proteins.The arrangement and combination of amino acids constitute the basic sequence backbone of proteins,which are called protein sequence.Analysis of protein sequences can be greatly helpful for further analysis of high-level structural information of proteins.Therefore,it is also the premise for protein structure and function analysis.In addition,it also plays an important role in many applications such as drug design.The first step is encoding the protein sequence and extracting features in the process of protein sequence analysis.There are three most commonly encoding methods.The first is converting amino acid residues into orthogonal vector,which named one-hot encoding,and second method is constructing a PSSM spectrum by using multiple sequence alignment and scoring,the last method is training amino acid vectors by word2 vec.This article uses these characteristics to analyze the protein sequence.The existing protein sequence analysis models based on bi-directional recurrent neural network are more effective,which is mainly because of the similarity of the protein sequence and text.However,the RNN,LSTM and other models still cannot effectively obtain the interdependence between long-distance texts,so this type of problems cannot be resolved very well.On the other hand the adjacent amino acid residues will be connected to each other through chemical bonds in the local scope of protein sequences.However,local amino acid groups and adjacent amino acid groups will interact with each other through various molecular forces,which can not be effectively recognized by traditional neural network.Based on the high correlation between the characteristics of the protein sequence,this paper uses Transformer model which is based on the self-attention mechanism to predict and analyze theprotein sequence.Self-attention mechanism is an algorithm to calculate the correlation between the features,so it is more suitable for protein sequence analysis.In conclusion,this paper constructs a Transformer-based framework for two typical protein sequence analysis problems,namely the protein secondary structure prediction and solubility analysis.In order to verify the effectiveness of the proposed model,numerical simulation experiments were performed on the proposed framework.Compared with the existing literatures,both experiments of our methods have achieved the best experimental results,indicating that the Transformer model based on the Self-attention mechanism can solve the protein sequence analysis tasks very well.At the same time,this paper analyzes the influence of different characteristics of proteins.We found that the models using PSSM and amino acid distributed representation as input are better.Compared PSSM with other features commonly used in existing methods,distributed amino acid representation is very simple,no extra calculations are needed.it can achieve considerable results.The Transformer-based protein sequence analysis methods proposed in this paper improves the efficiency of the algorithm and can surpass or approach the best methods available even with very simple features.
Keywords/Search Tags:Transformer, Self-attention, Sequence analysis, Protein secondary structure prediction, Protein water solubility prediction
PDF Full Text Request
Related items