Font Size: a A A

Investigation And Application Of Mass Spectrum Data Processing Pipeline

Posted on:2018-12-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:L W LiFull Text:PDF
GTID:1310330518465220Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
With the completion of the Human Genome Project,scientists initiated a comprehensive analysis strategy of all gene products – genome coded proteins in order to explore the essential nature and laws of life.In the post-genomic era,proteomics has become one of the most active areas of life science research,which targets the protein expression profile and their functions in different cells or tissues.The advance of mass spectrometry technology has provided an efficient analysis platform for proteomics research,with high-throughput,high-accuracy,high-sensitivity and high-resolution.Tandem mass spectrometry has become one of the most powerful techniques for protein identification,making the global protein profiling possible.Mass Spectrometry paltform combined with sequence database search method allows rapid identification of proteome in different cells and tissues.Shotgun proteomics strategy is one of the most popular experimental strategies which,however,undergoes a relatively low spectra identification rate(only 5%~30% of experiment spectra can get high-confident peptide identifications).Thus,the traditional protein sequences search strategy fails to solve key issues of protein identification completely.Applications of the database search strategy to analysis of mass spectrometry data involve ensuring the integrity and correctness of the identificaitons.Thus,selection of all assignments that are correct is one of the most important tasks in the mass spectrometry based proteomics researh.The development of methods for large scale proteomics data analysis lags far behind the development of mass spectrometry and accumulation of massive MS/MS data.Thus,it is necessary to establish platform for automatic analysis of large-scale mass spectrometry data.Otherwise,the high accuracy MS/MS data can through light into the geneome annotation.Searching genome database can resolve more MS/MS spectra,also the MS/MS data can be used to annotate the genome,which is the conception of proteogenomics.In this paper,we focused on the improvement of the quality control process for the database search strategy,the establishment of MS/MS data analysis pipeline,and its application to the analysis of large-scale proteome datasets.Firstly,we developed a new method which can achieve the unified quality control at the PSM,peptide and protein levels,and developed the software ProDistiller for quality control and protein assignment.Also,we explored the influence of protein sequence database on MS/MS data identification.Then we established a pipeline(MPSS)for the analysis of largescale and high-throughput proteome data.MPSS has been applied to the data analysis in Human Chromosome Proteome Project(C-HPP)and Human Liver Proteome.Finally,on the basis of searching genome and predicted protein sequence database,we established a pipeline to identify new isoforms and novel proteins,which applied to the in-depth analysis of huge human proteome data.Quality control at the protein level is more stringent than at the spectrum and peptide level.Especially for complex sample dataset,the more experimental data analyzed the more false positive identifications accumulated.On the basis of PSM quality control software PepDistiller,we developed the protein assignment and quality control software ProDistiller,which uses a pre-defined score F-value to sort and assemble protein one by one and stop at 1% protein FDR.ProDistiller is written in perl and can easily run on Windows or Linux operating system.The results of ProDistiller also contain much useful information of peptide identifications,including the charge state,missed cleavage sites and so on.Currently available protein sequence databases,including NCBInr,UniProt,RefSeq,and Ensembl,have similar theoretical peptides,but different alternative spliced forms(AS)of the proteins.Searching Uniprot or SwissProt obtained more results than using other databases because of the higher quality of these two databases.On the other hand,Uniprot and SwissProt have far less proteins than the Ensembl,RefSeq and NCBI nr databases.So,the computation cost using Uniprot or SwissProt will be smaller than using other databases.As a result,high quality,low redundancy Uniprot/Swiss-Prot database is the best choice in the proteomics research.The pipeline for the analysis of tandem mass spectrometry data(MPSS)can perform database search,quality control,results integration and reliability assessment.It has also taken into account the multi-node schedule,task assignment and results collection,which meeting the demand of high-performance computing.MPSS has been applied to the data analysis in Human Chromosome Proteome Project(C-HPP)and Chinese Human Proteome Project(CNHPP)More than 400 million spectra have been analyzed by MPSS.During the 2013's C-HPP study,we used MSPP to apply a multi-omics strategy to systematically analyze the transciptome,translatome,and proteome of the same cultured hepatoma cells(Hep3B,HCC97 H and HCCLM3)with varied metastatic potential qualitatively and quantitatively.The results provide a global view of gene expression profiles.The 9064 identified proteins covered 50.2% of all gene products in the translatome.Alternatively,we proposed that the transcription factor(TF)enrichment might be used to improve the detection of low-abundant proteins.There were 31 proteins uniquely identified using this simple enrichment.Then,the translatome data were used to construct a sample-specific database,which was used to discover the sample-specific single amino acid polymorphisms(SAPs)in the proteome.There were 219 unique peptides identified in the three cell lines,which comprised only 0.4% of all identified peptides.Those peptides with SAPs played a significant role in disease,although they might negligibly contribute to new protein identification.To get the most complete human liver proteome data,we systematically collected and analyzed liver-related mass spectral dataThe latest credible human liver proteome data identified 9901 genes,which is much higher than that in PeptideAtlas(4,408 genes).Compared with the liver-specific expression data in SwissProt and ProteinAtlas,we found that there are still a large number of missing proteins with very low F-Value spectra,which might be false negatives result.Finally,we established a pipeline for in-depth analysis of MS/MS data based on the genome database.By searching theoretical exon-exon database and AceView,we identified some highly credible candidates,including five peptides(might be new AS)and three new peptide segment.Although the results still require further experimental verification,it demonstrated the feasibility of genome annotation using mass spectral data.
Keywords/Search Tags:Proteomics, Mass Spectrometry, MS Data Processing Pipeline, Quality Control
PDF Full Text Request
Related items