Font Size: a A A

Computational Biology-Based Prediction Of Key Sites In Proteins

Posted on:2024-03-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:1520307109481124Subject:Intelligent Environment Analysis and Planning
Abstract/Summary:PDF Full Text Request
Environmental factors are important extrinsic factors that affect the growth and development of organisms,and the evolution of organisms is also the result of environmental factors.According to Darwin’s theory of evolution,in the long evolutionary history,organisms were in the continuous survival of the fittest in the process of adaptation to the environment,thus forming today’s biological communities.From a biological point of view,the process of adaptation to the environment is the continuous evolution of all kinds of biological macromolecules in the organism.Molecular biology researches have shown that the genetic information from all cellular organisms is preserved in the DNA,while the biological macromolecules that actually exercise their biological functions are proteins.With the development of research,it is increasingly recognized that proteins play an important role in the adaptation of organisms to the environment.Therefore,the study of proteins is important for revealing the intrinsic mechanisms of the adaptation of organisms to the environment.Biological experiments have proved that a variety of posttranslational modifications are essential for proteins to acquire their biological functions,and the network formed by interactions between different proteins is the basic system to regulate life activities.At the same time,the pathogenic antigens of many serious environment-related diseases are also proteins.The occurrence of these biological processes depends on the key sites in proteins,and the accurate and rapid identification of these key sites is important to elucidate the internal mechanism of biological adaptation to the environment and the diagnosis and treatment of related diseases.Although biological experimental methods can effectively locate the key sites in these proteins,the biological experimental process is time-consuming and laborious,it is difficult to quickly identify a large number of protein key sites,and not all biological experiments can be successful,these factors seriously restrict the development of related research.Therefore,new research tools are needed to address the challenges facing biological experimental research.In recent years,with the development of computational biology,used for protein post-translational modification sites,protein interaction sites and epitope prediction of computational methods are gradually put forward,these methods can quickly predict the key sites in the protein,prediction results can provide some clues to further in vivo or in vitro experiments,so as to promote the development of related research.The computational biological identification of key sites in proteins is studied as follows.(1)We propose a prediction method of protein-protein interaction sites based on an improved random forest algorithm.Firstly,we extract the features of protein to be predicted,including amino acid physicochemical properties,residue disorder score,sequence conservation score,protein secondary structure and 3 D structure features,and then use a feature selection algorithm based on minimal Redundancy Maximum Relevancy(m RMR)criterion and incremental feature selection(IFS)to optimize the original feature vector and extract the optimal feature vector.In the classification stage,we first propose an improved synthetic minority oversampling technique(SMOTE)to solve the positive and negative sample imbalance problem in the training set,and then propose a modified random forest model and train it for the identification of protein interaction sites.We analyze the features,and the results show that two features based on the three-dimensional structure of the protein,the accessible surface area and hydrophobicity,contribute mostly to the recognition accuracy.Finally,we design experiment to test the performance of the algorithm.The results show that the prediction sensitivity reached 85.61%,the specificity reached 82.33%,and the accuracy reached 84.13%,which is better than the existing algorithm,proving the effectiveness of our algorithm in predicting protein interaction sites.(2)A novel semi-supervised-based deep learning framework is proposed for the computational identification of a novel protein posttranslational modification type-succinylation modification sites.The proposed framework is consist of 3 modules.The first module is algorithm based on the improved semi-supervised learning algorithm called positive-unlabled learning,which is used to automatically build a high-reliability negative sample set from the unlabeled sample set and improve the quality of the training set.The second module is the positive sample amplification algorithm based on generating an adversarial network,which is used to generate new positive samples for the purpose of positive and negative sample balance.Based on this work,a deep network model is proposed for protein succinylation site prediction.In this framework,a total of three different feature extraction methods are used,including pseudo-amino-acid-composition used in the first module,the 1D effective vector to represent an amino acid sequence used in the second module,an amino acid sequence is transformed into five 21-dimension square matrix used in the third module,which is similar to the representation of digital images in the computer,contribute to the training of deep network.We performed a 10-fold cross test on the training set and tested the prediction performance with the independent test set.The results show that the reliable negative sample set construction strategy and the positive sample set amplification strategy proposed in this study can effectively improve the prediction performance of the algorithm,and the method compared with the existing succinylation site prediction algorithm.(3)A deep ensemble architecture based on protein sequence information is proposed for the computing prediction of B cell epitopes in antigenic proteins.For ensemble learning,the computing architecture contains seven independent convolutional neural network,firstly,using one-hot vector,the physicochemical property of amino acid and the secondary structure of proteins are used to transform one protein sequence into 8 feature matrix,as the input of 8 convolutional neural network.And then 8 independent convolutional neural networks are trained to identify B cell epitopes.The final prediction results of epitope come from integrating prediction results of the 8 convolutional neural networks based on the integration learning strategy.Firstly,we tested the performance of the proposed method using a universal test set.The test results showed that the sensitivity of predicting B cell epitopes reached 0.713,and the AUC and Matthews correlation coefficient reached 0.778 and 0.228,respectively.In addition,we tested the algorithm performance on 13 independent test sets,and the test results also outperformed the existing algorithms.The comparisons show that our method is capable of predicting conformational B-cell epitopes in an acceptable level of AUC.In addition,we predicted and analyzed the 2019 novel coronavirus(SARS-Co V-2)spike glycoprotein using the deep ensemble framework,and located potential B cell epitopes.These predictions can promote effective vaccine design against this highly infectious virus.
Keywords/Search Tags:Computational Biology, Protein-Protein Interaction Sites, Protein Post-translational Modification Sites, Protein Succinylation Sites, B-cell Epitope, Deep Learning, Semi-supervised Learning
PDF Full Text Request
Related items