Font Size: a A A

Building and Searching Unidentified Tandem Mass Spectral Libraries for Biological Sample Fingerprinting

Posted on:2014-09-08Degree:Ph.DType:Thesis
University:Hong Kong University of Science and Technology (Hong Kong)Candidate:Shao, WenguangFull Text:PDF
GTID:2454390005986368Subject:Engineering
Abstract/Summary:
Proteomics refers to the large-scale, high-throughput, system-wide study of proteins in a biological system. Among many work-flows used in proteomics, shotgun proteomics, which involves digesting proteins into shorter peptides, separating them by chromatography and then resolving them and fragmenting them in the mass spectrometer, has been the most popular and effective. The fragmentation patterns recorded in tandem mass (MS/MS) spectra can be used to deduce the peptide sequences, usually by a computational method called sequence database searching. This computationally intensive method relies on the genome sequence of the organism studied to enumerate all possible peptide candidates, and then scores them one by one to find the best match.;Traditionally, discovering all proteins and their isoforms has been the primary goal of proteomics. More recently, proteomics has seen a gradual transition to hypothesis-driven approaches, akin to DNA microarrays, for which the goal is to measure the same protein signals reproducibly and accurately. For this purpose, spectral reference libraries, which are compilations of previously observed MS/MS spectra, play the important role as an information hub, enabling researchers to store, merge, retrieve and share data. In this thesis, the main objective is to develop the necessary computational toolkit that extends the use of traditional spectral reference libraries to unidentified spectra, breaking free of the assumption that all spectra must be identified first to some peptide to be useful.;In the first part of this thesis, a novel method for denoising tandem mass spectra based on Bayesian inference is developed for spectral library building. This mainly aims to help improve the quality of spectral libraries, especially for singleton spectra, where the traditional way of merging multiple replicates of the same peptide ion into a consensus spectrum cannot be applied. As a result, spectra denoised by this method can retain more signal peaks, and have better performance in searching, than those filtered by intensity only.;In the second part of this thesis, a clustering algorithm of constructing tandem mass spectral library from both identified and unidentified MS/MS spectra is developed. Thus, the resulting library can function as a complete record of experimental data, allowing better data analysis and integration. Even in the absence of peptide identification, a properly compiled library of tandem mass spectra can function as a "fingerprint" for a biological sample.;In the third part of the thesis, the scoring function used in spectral library searching is redesigned to ameliorate some of its well-known shortcomings. The similarity score is transformed into a tail probability, which allows one to assign the statistical significance to every spectrum-spectrum match. This also enables one to forgo the use of target-decoy approaches -- which are not applicable without peptide identifications -- and instead rely on parametric mixture-model fitting to estimate the posterior error probability and thereby the false discovery rate.;Finally, the methods developed are applied to one practical applications intractable by traditional genome-obligated proteomics approaches. Our method is developed to identify the source of the blood meals of hematophagous arthropods, an enabling tool for the study of the ecology of infectious diseases in nature. This method, based on comparing the blood proteomes as recorded in unidentified spectral libraries, is sensitive, fast, cost-effective, evolutionarily accurate, and compares favorably to existing genome-based and single protein-based methods. In conclusion, unidentified spectral libraries can function as fingerprints for biological samples at the proteome level, and can be effectively utilized in applications such as species classification and microbial source tracking.
Keywords/Search Tags:Tandem mass, Spectral, Biological, Unidentified, Searching, Proteomics
Related items