Font Size: a A A

Experimental Analysis Of Protein Structure Classification Based On Language Processing Model

Posted on:2020-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:T Y ZhangFull Text:PDF
GTID:2370330596982427Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The study of spatial structure of proteins by computer means can be used as an effective supplement to experimental methods,applied to the prediction,design and comparison of protein spatial structures.Research in this field has become an important part of protein engineering.The spatial structure of proteins is based on the order of different kinds of amino acids.Different amino acids form complex spatial structures under the influence of peptide bonds,hydrogen bonds,van der Waals forces and electrostatic interactions,but there are certain rules.Therefore,studying the correspondence between amino acid sequences and protein spatial structure is called the vital content in structural biology.In this paper,the idea of language processing is used to study the classification of protein structure,and the different amino acid sequences in the protein structure are regarded as a natural language.In the previous research,the generative model was mainly used to predict the amino acid sequence to the spatial structure of the protein.This paper uses the discriminant model to carry out the research,which has not been proposed before.In this paper,the UniProt protein library was selected as the experimental data set of the protein molecular sequence.The data set was labeled according to the structural information of the protein in the PDB data set.The final data set contained 2,985,181 protein molecular sequences,each of which contained 50 amino acids.In this paper,two methods of text classification in skip-gram and FastText are selected on the word vectorization method.The word segmentation is 6 and 9,respectively.The word vector dimension is selected as 5 and 50 respectively,through nonlinear LSTM and linear FastText.The models were classified,and the test set and the extended test set were used for testing.Based on the above five sets of variables,10 sets of comparative experiments were carried out,and 20 experimental results were obtained.Finally,a text-based classification in FastText is used to classify the training set according to the number of words,and the dimension is 5,and then use the LSTM model to classify.The method can test the highest prediction accuracy in two different test sets,which can reach 68.61% and 80.89% respectively.
Keywords/Search Tags:Protein structure, Language processing, Machine learning
PDF Full Text Request
Related items