Font Size: a A A

Mass spectrometry-based proteomic data analysis

Posted on:2014-09-17Degree:Ph.DType:Thesis
University:Hong Kong University of Science and Technology (Hong Kong)Candidate:Yang, ChaoFull Text:PDF
GTID:2454390005995238Subject:Engineering
Abstract/Summary:
Proteomics studies large-scale cellular functions directly at the protein level. In proteomics, mass spectrometry (MS) has been a primary tool in conducting high-throughput experiments. In a typical shotgun proteomic experiment, proteins are digested into peptides by enzymes and analyzed by a mass spectrometer. A complete liquid-chromatogram mass spectrometry (LC-MS) dataset contains thousands of single stage spectra (MS1) and tandem MS spectra (MS2), which correspond to ionized peptides and their fragments, respectively. Qualitative and quantitative analysis of proteins from LC-MS data in an accurate and high-throughput manner are primary goals of proteomics.;In a proteomic data analysis framework, there are many intermediate steps. According to their objectives, they can be categorized into three major steps: preprocessing, peptide-level analysis and protein-level analysis. This thesis has made the following contributions in these three steps.;In the preprocessing step, we provide a survey and compare the performance of single spectrum-based peak detection methods. In general, we can decompose a peak detection procedure into three consequent parts: smoothing, baseline correction and peak finding. We first categorize existing peak detection algorithms according to the techniques used in different phases. Such a categorization reveals the differences and similarities among existing peak detection algorithms. Then, we choose five typical peak detection algorithms to conduct a comprehensive experimental study using both simulation data and real matrix-assisted laser desorption/ionization (MALDI) MS data. According to our study, the continuous wavelet transform-based method is the most effective one in practice.;In the peptide-level analysis step, we develop convex optimization models to perform peptide identification and peptide quantification. For peptide identification, we propose a new method named MIRanker. It uses information in the protein database and MS1 spectra to improve peptide identification results. According to our experiments on a standard protein mixture dataset, a human dataset and a mouse dataset, MIRanker achieves better peptide re-ranking results than existing methods including PetideProphet, PeptideProphet plus the number of sibling peptides and a score regularization method SRPI. For peptide quantification, we propose to estimate peptide abundance by taking advantage of peptide isotopic distribution and smoothness of peptide elution profile. Our method solves the peptide overlapping problem and provides a way to control the variance of estimation.;In the protein-level analysis step, we develop a new protein identification method. It provides a combinatorial perspective of the protein inference problem by calculating the conditional protein probabilities (Protein probability means the probability that a protein is correctly identified) under three assumptions, which lead to a lower bound, an upper bound and an empirical estimation of protein probabilities, respectively. The combinatorial perspective enables us to obtain an analytical expression for protein inference. We also study the relationship between our model and other methods such as one-hit rule, greedy algorithms, and the-state-of-the-art method ProteinProphet. The proposed method can achieve better results than ProteinProphet in a much more efficient manner.
Keywords/Search Tags:Protein, Mass, Proteomic, Data, Method, Peak detection, Peptide
Related items