Experimental Analysis Of Protein Structure Classification Based On Language Processing Model

Posted on:2020-05-18

Degree:Master

Type:Thesis

Country:China

Candidate:T Y Zhang

Full Text:PDF

GTID:2370330596982427

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The study of spatial structure of proteins by computer means can be used as an effective supplement to experimental methods,applied to the prediction,design and comparison of protein spatial structures.Research in this field has become an important part of protein engineering.The spatial structure of proteins is based on the order of different kinds of amino acids.Different amino acids form complex spatial structures under the influence of peptide bonds,hydrogen bonds,van der Waals forces and electrostatic interactions,but there are certain rules.Therefore,studying the correspondence between amino acid sequences and protein spatial structure is called the vital content in structural biology.In this paper,the idea of language processing is used to study the classification of protein structure,and the different amino acid sequences in the protein structure are regarded as a natural language.In the previous research,the generative model was mainly used to predict the amino acid sequence to the spatial structure of the protein.This paper uses the discriminant model to carry out the research,which has not been proposed before.In this paper,the UniProt protein library was selected as the experimental data set of the protein molecular sequence.The data set was labeled according to the structural information of the protein in the PDB data set.The final data set contained 2,985,181 protein molecular sequences,each of which contained 50 amino acids.In this paper,two methods of text classification in skip-gram and FastText are selected on the word vectorization method.The word segmentation is 6 and 9,respectively.The word vector dimension is selected as 5 and 50 respectively,through nonlinear LSTM and linear FastText.The models were classified,and the test set and the extended test set were used for testing.Based on the above five sets of variables,10 sets of comparative experiments were carried out,and 20 experimental results were obtained.Finally,a text-based classification in FastText is used to classify the training set according to the number of words,and the dimension is 5,and then use the LSTM model to classify.The method can test the highest prediction accuracy in two different test sets,which can reach 68.61% and 80.89% respectively.

Keywords/Search Tags:

Protein structure, Language processing, Machine learning

PDF Full Text Request

Related items

1	Research On Protein Remote Homology Detection Based On Machine Learning Methods
2	Research Of Protein-Protein Interaction Extraction Based On Rich Feature And Multiple Kernels Learning
3	Prediction Of Protein Structure And Function With Machine Learning Methods
4	Application Of Machine Learning Algorithm In Protein Structure Prediction
5	Protein-protein Interaction Sites Prediction Based On Natural Language Processing
6	Study On Key Problems Of Protein-Ligand Docking Based On Machine Learning
7	Consider Quantum State Tomography As Language Modeling Task
8	Research On Predicting Protein-protein Interactions Based On Machine Learning
9	Prediction Of Protein Structure Classes And Topology Analysis Of Protein Interaction Network Based On Support Vector Machine
10	The Machine Learning Model Of Protein Structural Prediction Based On Protein Sequence