Font Size: a A A

Investigation And Application Of The Quality Control For Peptide Identification In Shotgun Proteomics

Posted on:2011-04-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:J MaFull Text:PDF
GTID:1100360308974866Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
With the completion of the Human Genome Project, in order to explore the essential nature and laws of life, scientists launched a comprehensive analysis of gene products– genome coded proteins. Proteomics has become one of the most active areas of life science research in the post-genomic era. The development of mass spectrometry has provided a high-throughput, high-sensitivity and high-resolution analysis platform for proteomics. Now, tandem mass spectrometry is one of the most powerful technologies for protein identification, and it makes the global protein profiling possible.Tandem mass spectrometry combined with database searching strategy allows high-throughput identification of peptides and proteins in shotgun proteomics. However, it cannot solve the problem of protein identification completely, considering the diversity of biological samples, complexity of experimental process as well as the limitations of existing search algorithms. The problems of applying database search strategy to analyze mass spectrometry data can be summarized as two points, that is how to ensure the integrity and correctness of the identified results. Thus, selection of all those peptide - spectrum assignments that are actually correct is one of the most daunting tasks in mass spectrometry based proteomics investigations.In this study, we focus on the improvements of the quality control procedure for peptide identification in shotgun proteomics, and the primary aim is to distinguish the correct and incorrect matches effectively. The negative results of tandem mass data identified peptides are mainly caused by the ambiguous identifications and randomized identifications in database search strategy. The challenges that we face for quality control procedure in mass data analysis are as follows.1. The analysis of digested proteins by mass spectrometry is a complex physical and chemical process. The database search results are likely to be effected by many factors, such as the sample complexities, sequence databases, experimental protocols and types of instrumentation. Taking advantage of many new features would provide a means of improving the sensitivity of filtration methods.2. It is necessary to establish an objective evaluation system, which not only takes into account the specific data set as a whole, but also reflects the "personality" of each identified peptide by providing the confidence level for each single peptide or protein identification for experimenters.3. It is vitally needed to develop robust filter methods and models which can effective analysis data from multi-sources.4. Mass spectrometers that provide high-accuracy data are being increasingly used in proteomic studies. Utilizing the accurate mass measurement in data analysis strategy would become a trend in proteomics application.In this paper, on the basis of target-decoy database search strategy, we conducted a comprehensive investigation on two kinds of identifications that contribute to the negative hits, and the research focused on the improvements of the validation process of identified peptides, which involved improving the sensitivity, specificity, and generalizability of the filter methods.First of all, we evaluated the patterns and frequencies of ambiguous matches occurred in database search outputs, using the standard data sets, theoretically simulated spectra and real sample data. We also conducted an in-depth study about how the different mass error tolerance (MET) settings in database search affected the ambiguous matches'occurrence. The observations indicated that the peptide MET was the main reason that determinated the number of ambiguous matches. The ambiguous matches would be one of the effects that impact the calculated false positive rate of standard protein data sets; and it can be improved by using the searched database composed of low homology sequences. If the ambiguous matches of the same spectrum belong to different proteins, we recommend reporting all peptides as a peptide group and chose the favoring protein supported by other peptide identifications.Then, we presented and evaluated the filter methods for peptide validation procedure, specifically for high accurate mass data and two most commonly used search engines SEQUSET and Mascot.The hybrid linear trap quadrupole Fourier-transform ion cyclotron resonance mass spectrometer (LTQ-FT), an instrument with high accuracy and resolution, is widely used in the identification and quantification of peptides and proteins. However, time-dependent errors in the system may lead to deterioration of the accuracy of these instruments, negatively influencing the determination of the MET in database searches. We investigated the parent ion mass error distribution of the LTQ-FT mass spectrometer and applied an improved recalibration procedure to determine the statistical MET of different data sets. Based on the improved recalibration formula, we introduced a new tool, FTDR (Fourier-transform data recalibration), that employs a graphic user interface (GUI) for automatic calibration. Consequently, we presented a new strategy, LDSF (Large MET database search and small MET filtration), for database search MET specification and validation of database search results. As the name implies, a large-MET database search is conducted and the search results are then filtered using the statistical MET estimated from high-confidence results. By applying this strategy to both standard protein dataset and complex dataset, we demonstrated the LDSF can significantly improve the sensitivity of the result validation procedure.A Bayesian nonparametric (BNP) model was developed to improve the validation of database search results for SEQUEST, which incorporated several popular techniques, including the linear discriminant function (LDF), the flexible nonparametric probability density function (PDF) and the Bayesian method. The BNP model is compatible with the popular target-decoy database search strategy naturally. We tested the BNP model on standard proteins and real complex-sample data sets from multiple MS platforms (LCQ, LTQ and LTQ-FT) and compared it with the cutoff-based method, PeptideProphet and a simple nonparametric method. The performance of the BNP model was shown to be superior for all data sets searched on sensitivity and generalizability. Some high-quality matches that had been filtered out by other methods were detected and assigned with high probability by the BNP model. Thus, the BNP model could be able to validate the database search results effectively and extract more information from MS/MS data.The probability-based search engine Mascot has been widely used to identify peptides and proteins in shotgun proteomic research. Most subsequent quality control methods filter out ambiguous assignments according to the ion score and threshold provided by Mascot. On the basis of target–decoy database search strategy, we evaluated the performance of several filter methods on Mascot search results and demonstrated that using filter boundaries on two-dimensional feature space, the Mascot ion score and its relative score, can improve the sensitivity of the filter process. Furthermore, using a linear combination of several of the characters of the assigned peptides, including the Mascot score, 23 previously employed features, and three newly introduced features, we applied the Bayesian nonparametric model to Mascot search results and validated more correctly identified peptides in control and complex data sets than could be validated by empirical score thresholds, the cutoff-based method and linear discriminant model.With the rapid development of Human Proteome Project, the experimental instruments and techniques have made great progress. However, a huge number of heterogeneous data has been generated by different laboratories using diverse analytical strategies. In order to integrate the multi-sources data, on the basis of the Bayesian nonparametric model, we conducted a unified data analysis procedure of quality control for large-scale mass spectrometry data. By using this strategy, we reprocessed the mouse liver organelle expression data set of Chinese Human Liver Proteome Project, and greatly improved the peptide and protein identifications.Making use of available information which was typically ignored could benefit data analysis process in proteomics. Compared to early researches that only a few characters were used for mass data classifier, more and more features would be involved in mass spectrum data mining process. Combination of new features with an appropriate framework is making an important role in obtaining the good results. On the basis of these concepts, we have done several positively exploratory studies which focused on the application of computational and statistical methods in high-throughput MS/MS data analysis process to improve the quality control for peptide identification in shotgun proteomics.
Keywords/Search Tags:Proteomics, Bioinformatics, Database search strategy, Quality control
PDF Full Text Request
Related items