Font Size: a A A

Design And Implementation Of Thermophilic Protein Prediction Model Based On Machine Learning

Posted on:2021-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:P GaoFull Text:PDF
GTID:2481306197495844Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Protein is the material basis of life.The study of the thermophilic properties of proteins is helpful for understanding the principle of thermal stability of proteins,and has guiding significance for the development of biocatalysts in extreme environments,the development of drugs and the application in industry.It is timeconsuming and costly to identify thermophilic proteins by traditional biological methods,so it is imperative to design a fast and effective identification method.Most of the existing methods based on machine learning only extracted a single protein feature,which was not comprehensive for the expression of thermophilic proteins,and the existing methods only used a single classifier for prediction,which limits the performance of the model to a certain extent.Therefore,this dissertation takes the prediction of thermophilic proteins as the research goal,and studies the coding methods of proteins sequences and prediction of thermophilic proteins based on machine learning.The main research contents and results of the thesis are as follows:(1)Three prediction models of thermophilic proteins based on single features were studied.First,we used encoding based on grouped weight,g-gap dipeptide composition,and three-segment amino acid composition method to extract the feature of the protein sequence.Secondly,by using multiple classification algorithms for prediction,we found the best parameters and best classifiers for each feature extraction method,and three prediction models based on single features were constructed.After comparing the experimental results,it was found that the amino acid frequency can better reflect the intrinsic information of thermophilic proteins.In the case of using different feature extraction methods,the prediction results that use support vector machine(SVM)were better than other classification algorithms,which proved that SVM has superiority in the prediction of thermophilic protein.(2)To construct the model of thermophilic protein prediction based on feature fusion.Firstly,in order to characterize the thermophilic protein more comprehensively,the thermophilic protein sequence was characterized based on feature fusion by the combination of g-gap dipeptide,entropy density and autocorrelation coefficient.This method comprehensively expresses the amino acid information,physicochemical properties and the correlation of each residue in the thermophilic protein.Secondly,In order to reduce the computational complexity and improve the prediction accuracy,kernelized principal component analysis(KPCA)was used to reduce the dimension of the expressed protein sequence features.Finally,the best feature vectors were input to the SVM for prediction,and the jackknif method was used to verify the model with a variety of evaluation indicators.Experimental results showed that the prediction accuracy of the method on the selected data set was 92.81%,and the AUC was 0.97.This model had good performance on other standard data sets,which proved the validity of the thermophilic protein prediction model based on the combination of multiple features in this paper.(3)To construct the model of thermophilic protein prediction based on bi-layer cascade SVM.The first layer SVM classifiers were trained and optimized with different protein sequence features like encoding based on grouped weight,g-gap dipeptide composition,three-segment amino acid and feature fusion.Because SVM had superiority in predicting thermophilic proteins,the results from the first layer were cascaded to the second layer SVM classifier to train and generate the final model.The results showed that the overall prediction performance of the method further improved,and the accuracy was up to 94.51% under the verification of the jackknif method,and various performance evaluation indicators were higher than other methods.This model had good performance on other standard data sets,which proved the model had strong robustness and can effectively improve the prediction accuracy of thermophilic proteins.(4)To achieve the online prediction of thermophilic proteins.In order to facilitate the reference and research by other workers,a thermophilic protein prediction system based on the research results in this paper was developed.The system was developed based on Python and PyQt5.It can not only predict thermophilic proteins,but also visualize the research processes such as feature extraction,data reduction,machine learning algorithm modeling,and performance analysis.We verified the stability of the system through a series of performance tests to provide technical services for more protein prediction researchers.
Keywords/Search Tags:Thermophilic proteins, Machine learning, Feature fusion, KPCA, Bi-layer cascade SVM, PyQt5
PDF Full Text Request
Related items