Font Size: a A A

An Ensemble Approach To Protein Fold Classification By Integration Of Template-based Assignment And Support Vector Machine Classifier

Posted on:2018-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:J Q XiaFull Text:PDF
GTID:2310330566450279Subject:Biophysics
Abstract/Summary:PDF Full Text Request
A protein is a linear sequence of 20 standard amino acids,protein structure determines its function.Protein fold classification is a critical step in protein structure prediction.In nature there are one thousand kinds of protein folds,the research of fold recognition,development of effective prediction algorithm,not only helps to understand the inherent laws of protein folding,but also has important biological significance in the study of protein structure.There are two possible ways to classify protein folds.One is through template-based fold assignment,when the sequence similarity is high,the template-based fold assignment can get a good prediction effect,but with the decrease of similarity,the reliability of the template-based fold assignment is also greatly reduced.The other is ab-initio prediction using machine learning algorithms.It is based on the amino acid sequence to extract the structural features of the protein and then predict the folding type.Combination of both solutions to improve the prediction accuracy was never explored before.This article has carried on the exploration and has obtained the good result.We developed two algorithms,HH-fold and SVM-fold for protein fold classification.HHfold is a template-based fold assignment algorithm using the HHsearch program.SVM-fold is a support vector machine-based ab-initio classification algorithm,in which a comprehensive set of features are extracted from three complementary sequence profiles.These two algorithms are then combined,resulting to the ensemble approach TA-fold.The proposed methods are evaluated and compared with both ab-initio and template-based threading methods on six benchmark datasets.An accuracy of 0.799 was achieved by TA-fold on the DD dataset that consists of proteins from 27 folds.This represents improvement of 5.4~11.7% over ab-initio methods.After updating this dataset to include more proteins in the same folds,the accuracy increased to 0.971.In addition,TA-fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds.Experiments on the LE dataset show that TA-fold consistently outperforms other threading methods at the family,superfamily and fold levels.The success of TA-fold is attributed to the combination of template-based fold assignment and ab-initio classification using features from three complementary sequence profiles that contain rich evolution information.
Keywords/Search Tags:Protein folds prediction, feature extraction, SVM
PDF Full Text Request
Related items