Font Size: a A A

Studies On Several Key Issues Of Mass Spectrometry Data Processing In Proteomics

Posted on:2012-06-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:H C SunFull Text:PDF
GTID:1118330362460457Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Mass spectrometry (MS) has become a powerful tool for modern proteomics studies. MS data is the fundamental resource for proteome information mining, and MS data processing is the core of computational proteomics. It remains a big challenge for proteome research to interpret complex peptide information from simple physical signals, and then to extend to address specific biological problems. In this dissertation, we mainly focused on four key issues in MS data processing:(1) Quality assessment of tandem mass spectrometry (MS/MS). High-throughput proteomics experiments usually produce large amounts of MS/MS data, but the majority of these precious MS/MS spectra are with too low quality to be utilized. Filtering out the low quality MS/MS spectra is one of the strategies to increase computational speed of database searching. In order to obtain more accurate features, we first adjusted estimation of peptide length using linear regression. Then, we presented a new method for MS/MS quality assessment, and trained the classification model for 4 kinds of MS/MS data. In order to prove the capability of our method, we used 2 datasets from different laboratories to test the accuracy and reproducibility, and compared it with msmsEval. The results showed that our method achieves higher accuracy and better performance than msmsEval. Based on the algorithm, we developed a software tool for MS/MS quality assessment, which is named as VSMS.(2) Peptide sequence tag generation. Peptide sequence tagging (PST) directly interprets partial peptide sequences from MS/MS spectra. It is a flexible method which has been widely used in protein identification as well as mapping post-translational modifications (PTM) to peptides. We presented a new method for peptide sequence tagging via doubly charged tandem mass spectra, and designed a new scoring function in this method. We used 2 public datasets from different types of mass spectrometers (Thermo LTQ-FT, LTQ and LCQ, Waters/Micromass QTOF) to assess our method, and compared it with the widely used PST algorithm InsPecT. The results showed that our method achieves higher accuracy and better performance than InsPecT. Based on the algorithm, we developed a software tool for peptide sequence tagging, which is named as TVNovoTag.(3) De novo peptide sequencing. De novo peptide sequencing is one of the most challenging topics in the field of computational proteomics, which is usually used for new peptide and protein discovery. We presented a novel method based on virtual database searching to improve the performance of de novo sequencing for doubly charged spectra from high resolution LTQ-FT mass spectrometry. We employed 2 datasets from different laboratories to assess our method, and compared it with 2 widely used de novo sequencing algorithms PepNovo and NovoHMM, and the results showed that our method obtains better performance on most of the indices. Based on the algorithm, we developed a software tool for de novo peptide sequencing, which is named as TVNovo.(4) Analysis of stable isotope labeling MS data. With the advance of experimental technologies, large-scale protein quantification is widely applied to sample analysis in proteomics. We presented a least square fitting with nonlinear optimization algorithm for the universal quantitative scheme on single MS spectrum from different isotopic labeling techniques, and designed a compact Index MS raw data file (IdxRaw) to accelerate the quantitative information extraction. We used 3 datasets with different isotopic labeling to assess our method, and discussed some important issues involved in the quantitative algorithm development. The results showed that our method supports fast random accession of MS spectra, automatic decomposition of the overlapped isotopic cluster and multiple labeling techniques. Based on the algorithm, we developed an effective tool for stable isotope labeling quantification with accelerated calculation speed, which is named as SILVER.On the other hand, basic bioinformatics software tools development is also an important mission. In this dissertation, we also presented two novel software tools, they are Tmod and PNmerger.We presented a software tool called Toolbox of Motif Discovery (Tmod) for Windows operating systems. The current version of Tmod integrates 12 widely used motif discovery programs. Tmod provides a unified interface to ease the use of these programs and help the users to understand the tuning parameters. It allows plug-in motif-finding programs to run either separately or in a batch mode with predetermined parameters, and provides a summary report that comprises the outputs from multiple programs. Tmod can also be easily expanded to include future algorithms.For a protein interaction network, PNmerger can automatically annotate the network proteins with pathway information extracted from KEGG, find the known pathway elements in protein network, and predict the possible pathway elements. To present the pathway information for the protein network, PNmerger illustrates the clusters of the nodes with the same biological pathway, and also presents the potential crosstalk elements between different pathways. This information will be helpful for the users to find the important clues for knowledge discovery and experimental design.
Keywords/Search Tags:Bioinformatics, Mass spectrometry, Data Processing, Spectra Quality Assessment, Peptide Sequence Tagging, De novo sequencing, Stable Isotope Quantification, Software Development
PDF Full Text Request
Related items