Font Size: a A A

Transmembrane Protein Prediction using Support Vector Machines

Posted on:2011-06-11Degree:M.SType:Thesis
University:University of California, IrvineCandidate:Shivaram, KiranFull Text:PDF
GTID:2440390002457312Subject:Biology
Abstract/Summary:
Transmembrane (TM) proteins are an important family of proteins, responsible for key biological functions. Understanding their structure and properties has proved to be challenging as they are poor targets for experimental study. Motivated by this limitation, several studies have been carried out previously, to predict and model both alpha helical transmembrane (ahtm) proteins and beta barrel transmembrane (bbtm) proteins. To date, majority of these studies, especially those relating to bbtm have been based upon relatively small datasets consisting at most 400-2000 protein sequences. In the wake of protein databases that are rapidly expanding due to the sequencing of new proteins, novel and robust methods that pass rigorous testing on larger data are necessary to consistently and efficiently identify new TM proteins.;By employing a Benchmark dataset consisting approximately 1900 proteins, we extract and evaluate 218 features, including 116 novel features, which are further used to train a 3 class support vector machine (SVM) to discriminate ahtm and bbtm proteins from non transmembrane (ntm) proteins. The resulting predictor, ABTMPro, shows relative improvement when compared directly with existing methods, employing standard evaluation metrics. It achieves an accuracy of over 98% and Matthews correlation coefficient (MCC) values as high as 0.939 and 0.945 for ahtm and bbtm proteins respectively, estimated using multiple, 10-fold cross validation runs.;Next, we construct a larger, less-redundant dataset consisting of over 10,000 proteins, which are randomly split into either a training or test set. From the training data, we extract the same set of features as previously stated and ultimately train a new, 3class SVM using which, predictions are made on the test set. Although we achieve a lower accuracy of 97% and MCC values 0.856 and 0.636, for ahtm and bbtm proteins respectively in comparison to our results on the smaller dataset, we still significantly outperform predictions made by existing methods on the same test set.
Keywords/Search Tags:Proteins, Transmembrane, Test set, Using
Related items