Font Size: a A A

Prediction Models For The Effect Of Point Mutation On Protein Folding Rate

Posted on:2024-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2530306941463734Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Protein folding is the process of protein polypeptide chains from a random coiled state into a specific three-dimensional structure.Protein folding rate is an important parameter in the protein folding process,which measures the speed of folding.Accurately predicting the effect of point mutation on protein folding rate is of great significance for studying folding mechanism,designing site-directed mutation and the synthesis of new protein.In addition,point mutation may affect the folding rate of proteins,leading to changes in protein structure and function,and even cause diseases.With the accumulation of related variant data,computational methods have been used to study the effect of point mutation on protein folding rate.Considering that proteins have a certain tolerance to mutation,we define the prediction of the effect of point mutation on protein folding rate as a three-classification problem:decreasing the folding rate,having no effect on the folding rate,increasing the folding rate.Several models are constructed based on ensemble learning and transfer learning,and the main contents include:(1)Aiming at the problem that most of the existing prediction models are constructed by a single algorithm,which is difficult to make full use of mutation data and related features,an ensemble prediction model PON-Fold based on Stacking strategy is proposed.Firstly,341-dimensional biological features are collected and calculated from five aspects including amino acid properties and mutation information;and 28-dimensional features are selected by LightGBM algorithm and forward search.Then,in view of the unbalanced data,the interpolation down-sampling strategy is adopted to construct multiple balanced training subsets,and multiple base learners are trained by RF,XGBoost and LightGBM algorithm respectively.Finally,a meta-learner is trained using the logistic regression algorithm,and an ensemble model is constructed based on the Stacking strategy.Compared with the existing prediction method Folding race,PON-Fold has more balanced prediction performance on the three types of samples,and the prediction accuracy on the test set is increased by 4.4%.(2)Considering that only using manually calculated biological features is not comprehensive enough and there is currently a limited amount of protein folding-related variant data available.,we further propose to use the deep learning model to extract features from the sequence context and structural neighborhood residues,and perform transfer learning based on protein stability to construct a model PON-FoldST.Firstly,the sequence context and structural neighborhood residues of the mutation site are obtained,and their features are encoded from multiple perspectives such as amino acid properties and protein evolution.Then,the one-dimensional convolutional network,bidirectional long short-term memory network and multi-head attention mechanism are used to extract features from them.And the extracted features are fused with the 28-dimensional biological features screened in(1).In terms of model training,the model is firstly pre-trained by using more mutation data related to protein stability,and then the pre-trained model is fine-tuned by using protein folding rate related mutation data to obtain the prediction model PON-FoldST.In addition,a fusion model is constructed based on PON-Fold and PON-FoldST.Compared with PON-Fold,the prediction accuracy of the fusion model on the test set and five-fold cross validation is increased by 1.1%and 2.2%,respectively.Finally,an online service platform is developed based on the fusion model to provide prediction services for researchers.In summary,several models are constructed to predict the effect of point mutation on protein folding rate,which makes up for the defects of simple biological features and single algorithm model used in previous studies.Additionally,further optimization is made in terms of feature comprehensiveness and the model training.As a result,the models achieved good prediction performance.
Keywords/Search Tags:Point Mutation, Protein Folding Rate, Stacking Ensemble Learning, Transfer Learning, LightGBM
PDF Full Text Request
Related items