Font Size: a A A

Prediction of catalytic site in proteins using support vector machine classifier and conservation of prediction approach

Posted on:2009-05-20Degree:Ph.DType:Dissertation
University:Georgetown University Medical CenterCandidate:Petrova, Natalia VFull Text:PDF
GTID:1440390005459535Subject:Biology
Abstract/Summary:
Background. The number of protein sequences deriving from various genome sequencing projects is outpacing our knowledge about the function of these proteins. With the gap between experimentally characterized and uncharacterized proteins continuing to widen, it is necessary to develop new computational methods and tools for functional prediction. The knowledge of catalytic sites provides a valuable insight into protein function. Although many computational methods have been developed for catalytic site prediction, their accuracy remains low (in the 70% range) with a significant number of false positives. This research aims to develop a novel method for the prediction of protein catalytic sites based on an optimal discriminative set of sequence and structural properties using a supervised machine learning algorithm and conservation analysis of protein evolutionary relationships.;Results. A benchmarking dataset of 79 enzymes with 254 catalytic residues was used for construction and performance analysis by 10-fold cross-validation of the predictive model. The model was built using the best-performing machine learning algorithm among 26 tested---Sequential Minimal Optimization (SMO), a Support Vector Machine (SVM)---coupled with an optimal subset of seven sequence and structural features selected by the Wrapper Subset Selection algorithm. The predictive model achieved a predictive accuracy of 86%, but with a large number of false positives. The results of the prediction were subsequently analyzed by an innovative approach---a Conservation of Prediction (CoP)---to capture catalytic sites conserved in protein families and domains, thus drastically reducing the number of false positives and increasing the predictive accuracy to 97.3%. Combining both the predictive model and the CoP, a stand-alone Java program, ENDURANCE, was implemented for the automated prediction, analysis and visualization of the obtained results.;Conclusions. This research has developed a distinctive method and an automated tool for the large-scale prediction of catalytic sites in diverse protein families with high predictive accuracy and low false positive rate. Careful analysis of large domain families, such as the alpha/beta hydrolases, further indicated that many non-catalytic sites predicted by ENDURANCE, may represent functional "supersites" that potentially play important roles in enzyme function.
Keywords/Search Tags:Catalytic, Protein, Prediction, Machine, Function, Conservation, Using
Related items