Font Size: a A A

Computational algorithms for spectral prediction and motif discovery in proteomic sequence data

Posted on:2007-01-22Degree:Ph.DType:Dissertation
University:Harvard UniversityCandidate:Schwartz, DanielFull Text:PDF
GTID:1458390005990895Subject:Molecular biology
Abstract/Summary:
As the products of the information age continue to permeate into the biological realm, there is an ever-growing need for computational tools to keep pace. Here, two such tools are presented. Although largely unrelated in detail, both have stemmed from a desire to better understand biological sequences through the creation of algorithms harnessing the statistical power contained within large-scale proteomic data sets.;The first of these tools is aimed at the prediction of tandem mass spectral fragment ion intensities with the goal of improving peptide sequencing. Using a large database of confidently assigned doubly charged spectra, data on the two residues surrounding fragmentation sites as well as their relative position by peptide mass was collected. In addition to providing never before visualized trends in tandem mass spectra, results indicate that this information used in conjunction with the outlined spectral prediction methodology is sufficient to model fragment ion intensities with high accuracy considering inherent spectral variability. Furthermore, to assess the likelihood of a sequence/spectral identification, a scoring scheme based on the overlap of high intensity peaks between actual and predicted spectra is described. The SPIIDR (Spectral Prediction of Ion Intensities using DiResidues) algorithm is available for public use at http://gygi.med.harvard.edu/spiidr/.;The second computational tool presented was initially aimed at the discovery of phosphorylation motifs from large-scale phosphoproteomic studies, however, its success at extracting overrepresented patterns from any sequence-based data set, including whole proteins and linguistic text, is also demonstrated in this work. To deconvolute a data set into constitutive motifs, the algorithm uses a dynamic statistical background coupled to an iterative two phase methodology based on recursive motif building and subsequent set reduction. Validation of the approach is exemplified through numerous positive control data sets as well as through the congruity of extracted motifs with those discovered using orthogonal strategies. Furthermore, comparison of the algorithm to other widely used protein motif discovery tools, and its ability to extract previously known biologically-significant motifs, highlight its success. Finally, an in depth overview of the online embodiment of the methodology, known as motif-x, (located at http://motif-x.med.harvard.edu), is provided.
Keywords/Search Tags:Spectral prediction, Motif, Data, Algorithm, Discovery
Related items