Font Size: a A A

The Research On The Quality Control Methods Of Database Search Results Of Tandem Mass Spectrometry Data In Proteomics

Posted on:2008-03-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:1118360242999221Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Proteomics aims to systematically investigate the function molecules of the life-proteins, at the global level. Because the varying range of the protein expression in a biological system may exceed 6 order of magnitude, the physical and chemical properties of them are very complex, proteomic research needs high-throughput and high sensitive experiment platforms. Biological mass spectrometry (MS) has these characters and thus become a supporting technique of proteomic researches. Because of the complexity of the sample and the complex chemical and physical mechanism of the MS experiment, MS data involves complex noises and MS data process is an open, hot and difficult problem of proteomics. Database searching serves as a popular method of the mass spectrometry data process by comparing the experiment mass spectrum with the predicted spectrum of the digested peptide in a target protein sequence database, and finding the best matches with some scores aimed to measure the match quality. A database search result (also called peptide identification) is the best match in a limited searching space, which is not necessarily correct. Because of the huge computing burden, the automatic database search software interprets the mass spectra roughly and without any effective methods to evaluate the confidence of the resulting matches. Therefore, the problem with quality control of the mass spectrometry data is notable in the fowllowing areas: (1) Integrating the MS data from the multiple laboratories and multiple platforms is a common manner in the proteomic research. Thus, a universal quality control framework is needed for the large-scale proteomic research. (2) It is difficult to set up the probability model based on the complex physical model of the MS experiment. Many models used in the data quality control of peptide identifications were obtained by observation, statistic fitting or training from the standard dataset, so that the universality of these models is doubtful and the validation of the results given by the model is tiring work in the proteomic research. (3) One reason for the complexity of MS data is that the statistical characters of the data may change with the experiment conditions, environment factors and treating samples making it very difficult to build universal algorithms for the MS data process. (4) The various chemical and physical mechanisms involved in the MS experiment leads to the existence of many sub classes in the MS data. It is difficult to model the database search problem with a one-size-fit-all algorithm. Hence, multiple parameters were used to validate the database search results. Those parameters measure the match quality between the mass spectra and peptides in different aspects. The integration and fusion of multi-source information and synthetic decision-making is needed for quality control of the peptide identifications. (5) The huge data volume in the proteomic research brings about notable computing problems. This paper is intended to address these problems in the database search result validation, and focuses on the optimization of some database search parameters, extraction and selection of the features for the classification of correct and random peptide identifications, and some algorithms and schemes for the evaluation of peptide identifications based on the randomized database searching. The main work includes: (1) The optimization of some database search parameters. Database search is the base of the quality control of peptide identifications. Many parameters need to be specified by the user before database searching. Some of database search parameters can restrict the candidate peptides of a mass spectrum and affect the database search results greatly. These parameters rely on the character of the instrument and the physical and chemical theory of the experiment, and can be affected by the work status of the instrument, the experiment protocol and the complexity of the sample. In many researches, these parameters are selected as the recommended values provided by the instrument manufacturer or references. Statistical conclusions are lacking about their optimized values, which should be based on the experiment data of the user. Actually, many the database search parameters can be estimated form the results of the exploring database search. On the other hand, many reference datasets with strict experiment design have been published, which can be used to analyze and optimize the database search parameters. In this paper, the influence (on the database results) of mass error of parent ions, m/z error of the fragment ions and the enzyme specificity were investigated using the reference datasets and statistical methods. A robust method was proposed to estimate the mass error tolerance of the parent ions and the m/z error tolerance of fragment ions from the data with noise. An improved recalibration law was proposed for the high accuracy Fourier-transform mass spectrometry based on the observation that the mass error increases with the retention time. The m/z error of the fragment ions was found to decrease with the signal intensity of the ions, and an empirical formula is provided to determine the m/z error tolerance according to the signal intensity. The distribution of the number of miss-cleavage sites of the correct peptide identification and the distribution of the number of peptide identifications with different tryptic terminals is also analyzed. Based on the work in this section, we proposed a database search strategy that enlarges the actual database search parent mass error tolerance at first and than filters the results based on the statistical parent mass error tolerance. This strategy was applied to a control dataset and the results showed that it could improve the discriminant power of the database scores.(2) Feature extraction and selection of the quality control of database search results. The quality control of database search results is a typical pattern classification problem. Feature extraction and selection is the essential work of pattern classification. This paper summarized the parameters of the quality control of database search results, which include the database scores, the basic character of the mass spectrum and peptide and the empirical parameters proposed in different literatures. And then, this paper introduced the generation of theoretic MS/MS spectrum and the measurement of the discriminant power of these features. In this research, the discriminant powers of some features were optimized based on the background knowledges and exploring data analysis. Meanwhile, some practice problems about the application of peptide retention time to the validation of peptide identifications were discussed and settled. A set of features proposed in different literatures were summarized and defined. Finally, based on the background knowledge and the clustering analysis, correlation analysis was performed on these features and the basic rules were provided for the feature selection of different methods of database search result validation, which will be used in this paper.(3) The work on the validation of peptide identifications based on the randomized database searching. Currently, the randomized searching based methods can provide a universal framework for the quality control of MS data with different samples, different platforms, different experiment conditions and different database search softwares. However, many practical problems with the randomized database searching based methods are not adequately solved and the evaluation research on the performance of the randomized database searching based methods is still primary. This paper proposed a method for the construction of randomized database, which could avoid the share peptide problem. Then, four methods were proposed to validate the database search results: linear discriminant function based method, ln(Xcorr) and (ΔCn)1/2 margin distribution fitting based method, the multivariate nonparametric density estimation based method and the Bayesian nonparametric model based method. These efforts aimed to provide some solutions for the discriminant functions and the feature fusion in the randomized database searching based methods, and thus improve the sensitivity of the database search result validation. The linear discriminant function based method was easy to use and had been applied to the Human Liver Proteome Project (HLPP). ln(Xcorr) and (ΔCn)1/2 margin distribution fitting based method got almost the same results with the linear discriminant function based method. The other two methods used more features and the sensitivity of them is improved a lot. These methods were evaluated using the control datasets and real sample datasets and were proved to be more sensitive than traditional randomized database searching based methods. In addition, the false positive rate estimation was proved accurate enough on the control dataset. On the other hand, we compared the performance of the randomized database searching based method with PeptideProphet and found that the randomized database searching based method could get better performance on datasets from different instruments and laboratories. The generalization performance of the randomized database searching method was improved.In a word, this paper revealed a series of problems of the quality control of tandem mass spectrometry data by applying the statistical analysis to the huge datasets in proteomic research, which had varying statistics and contained complex noise inherently. Consequently, a systematic research on the optimization of some database search parameters, extracting and selecting of the features for the classification of correct and random peptide identifications, and some algorithms and schemes for the evaluation of peptide identifications based on the randomized database searching was provided. The methods proposed in this paper can largely improve the sensitivity of the validation of peptide identifications and overcome the variation of the datasets, which were based on the multi-source feature fusion and feasible nonparametric technique. The methods proposed in this paper have been applied in the HLPP.
Keywords/Search Tags:database searching, data quality control, tandem mass spectrometry data analysis, false positive rate, nonparametric density estimation, Bayesian classification, feature extraction, linear discriminant function
PDF Full Text Request
Related items