Font Size: a A A

Research On Protein Hot Spot Residues Based On Ensemble Learning And Deep Embedding Learning

Posted on:2022-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:S J YaoFull Text:PDF
GTID:2480306542967469Subject:Biology
Abstract/Summary:PDF Full Text Request
Protein is one of the most basic macromolecules that constitute a biological organism,and it participates in the entire process of biological activities.The interaction between protein and protein is one of the most important ways for protein to function.During this process of interaction,a small part of residues contribute most of the binding free energy,and these residues are called hot spots.The discovery of these hot spots plays an important role in understanding a variety of physiological and biochemical reactions inside and outside the cell,as well as designing drug targets.Currently,biological experiments to identify hot spots require more manpower and material resources.Therefore,calculation method was used to carry out the corresponding experiment.At present,researchers have successively developed various calculation methods for the prediction of hot spots,but most of these work need structure information of protein and other information.For the analysis of protein structure information,the protein sequence is easier to analyze and obtain.Therefore,this paper uses the sequence information of protein to carry out the prediction research work of hot spots based on traditional machine learning methods such as K nearest neighbor,support vector machine and convolutional neural network.The specific research work is as follows:1.We proposed a full-sequence prediction model of protein hot spot residues based on machine learning method.Most of the methods before were based on the premise of interface residues.Among the interface residues,researchers will define hot spots and non-hot spots based on the change in binding free energy before and after the amino acid mutated to alanine.This resulted in a more balanced data set.In fact,the hot spot residues on the protein sequence only account for 1%?2% of all residues.At the same time,the accurate identification of interface residues also presents a great challenge.Therefore,in order to be more in line with the actual situation,we start from the complete protein sequence to predict the hot spot residues.In this experiment,we first collected four data sets of protein hot spot residues.And we constructed 67 sample subsets with balanced positive and negative samples in ASEdb training set.Next,we extracted 1415-dimensional features based on full sequence information,and used the relief F algorithm to reduce the dimension.Finally,two classification algorithms of Knearest neighbor and support vector machine are adopted to predict in 67 sample subsets respectively,and the final prediction results are obtained through an improved majority voting algorithm.Among them,the result of F1 score in the BID test set is 0.593,which is better than the previous best experimental result.2.Predict hot spot residues in protein sequences based on deep embedding learning method.Using the method of protein sequence embedding,the corresponding 4117-dimensional features based on protein sequence are obtained.Then the protein sequence is cut,and a specific segment is used to indicate the residue at the center position.Finally,one-dimensional convolutional neural network is used for prediction.Compared with traditional machine learning methods,the prediction results obtained by deep learning are significantly improved.
Keywords/Search Tags:Protein hot spot residues, Interaction between protein and protein, Sequence information, Machine learning, Deep learning
PDF Full Text Request
Related items