Font Size: a A A

On Preprocessing Of Tandem Mass Spectra For Protein Identification

Posted on:2007-11-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:J F ZhangFull Text:PDF
GTID:1118360185454196Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
It has been a well-known method to identify proteins by identifying peptide sequences(or called peptide sequencing) using the tamdem spectra. During experiments, the peptides separated from liquid chromatographers are fragmented and ionized by collision-induced dissociation (CID) and the ions are measured by mass spectrometer in mass/charge ratios (m/z). Consequently, the peptides can be identified by these m/z values of ions in tandem spectrum with a sequence database searching or De novo sequencing or the combining of the two above methods.However, the numerous noise and isotopic peaks in high resolution tandem spectra (such as Q-TOF spectra) lead to a heavy computational cost in peptide identification. Furthermore, they can cause either false negative or false positive peptide identifications since they may match with the theoretical ions of an irrelevant peptide sequence. In addition, the measurement errors of ion masses in spectra puzzle the identifaction too. Therefore, the data preprocessing should be introduced before peptide sequencing.This thesis aims to discuss the theory, algorithms and the application in preprocessing, and propose methods to preprocess tamdem spectra in order to increase the accuracy of peptide identification and decrease the computation complexity.Firstly, a key concept of Isotope Pattern Vector (IPV) which digitally characterizes the isotope cluster of a fragment ion universally is proposed in the thesis. Thus, the noise peaks and real peaks in spectra can be distinguished by the quantitative IPV value, the formulae of fragment ions can be predicted and the mass measurement errors can be analyzed.Based on the concept of IPV, a new algorithm, PeakSelect, is proposed to find the monoisotope of ions in spectra which are crucial in peptide sequencing. In PeakSelect, we analyze the fundamental difference between noise peaks and ion peaks, the distribution of noise in intensity, and the complex overlapping of isotope peaks in specta. By applying machine learning method, some features are proposed to distinguish the different information in spectra and a decision tree is constructed to classify the peaks into different categories such as noise, single ion peaks and overlapping peaks. Therefore, all of the potential monoisotopic masses of ions can be calculated. Experiments show that PeakSelect decreases greatly the computational times and increases the reliability of peptide identifications. In particular, PeakSelect performs well on complex spectra with a large number of peaks d from large peptides, and supports more sequence identification than other well-known systems such as ProteinLynx? Global Server.To know the mass measurement error, we need know the theoretical masses of fragment ions in spectra. Therefore, we present a new method, FFP (Fragment ion Formula Prediction), to predict elemental component formulas of fragment ions and then know their theoretical...
Keywords/Search Tags:bioinformatics, protein identification, tamdem spectra, isotope pattern, preprocessing of spectra
PDF Full Text Request
Related items