Font Size: a A A

Research On Protein-protein Binding Sites Prediction Method Based On Sequence Information

Posted on:2019-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:W HeFull Text:PDF
GTID:2370330566998321Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The organisms rely on the interaction of proteins with proteins or other substances to accomplish various life activities.Studies on protein interactions are of major importance in understanding the mechanisms of activities in organisms,and there are also extensive theory and application prospects.In this paper,we mainly research the method of protein binding site prediction based on machine learning theory.The basic principle is extracting and combining various types of protein-related information and representing the amino acid sequence with appropriate eigenvector.And then use the scientific classification algorithm based on these characteristics to determine the amino acid category.In this paper,we research the method of sequence feature extraction based on long and short term memory network,and improve the original algorithm by analyzing the biological background theory of the problem.In addition,from the perspective of multi-feature learning,this paper also attempts to use the different types of information to construct an integrated learning model with two layers structures,so that the prediction effect is improved.This paper introduces a method of sequence information extraction based on improved long short term memory network model.The specific improvement of this method is reproduced below.First,in order to reflect the clustering characteristic of the protein binding site distribution,the output layer of the network is connected to the input layer of the next-time step,thereby the category information of adjacent residues of the target amino acid is introduced into the network.On the other hand,in order to solve the irrationality of appointing the order of protein sequences artificially,this paper train two independent prediction models by modifying the training process of the model.In this method,amino acid sequences data are scanned in two directions: forward and backward,and are utilized to train the network models respectively.And then the weighted results of the two classifiers are used as the final classification basis.Finally,the effectiveness of the algorithm is verified by the comparison experiment and the corresponding result analysis.Due to amino acid residues in the protein chains have various types of physical and chemical properties,primary structure and spatial structure.This paper introduces an integrated learning model with multiple types of features to represent the amino acid sequence more effectively.The model is divided into two layers.The first layer consists of three base classifiers,which utilize the position specific scoring matrix,Bi-gram and pseudo-amino acid as the features respectively.By dividing the data set,the training of each base classifier and the prediction of all samples are done using a strategy similar to cross validation.Next,the prediction result of the base classifier is combined with the sequence feature extracted by the improved long short term memory network model in the previous chapter,and then used together as the eigenvector of the second layer,and complete the final classification.In this paper,experiments are carried out on the three data set groups which are divided according to the sequence alignment results.Finally,this paper analyzes the relationship between the prediction performance of the base classifiers and the integrated learning classifier and also the relevant parameters of the model,and then compares the result with the previous methods to verify the effectiveness of the proposed method.
Keywords/Search Tags:prediction of protein binding sites, long short term memory networks, features fusion, integrated learning strategies
PDF Full Text Request
Related items