Font Size: a A A

The Research Of Intact Protein Identification Algorithm

Posted on:2019-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q DuanFull Text:PDF
GTID:2370330566484152Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Identifying protein and their post-translational modifications are critical to the success of proteomics.Recent advances in mass spectrometry(MS)instrumentation have made it possible to generate high-resolution mass spectra of intact proteins.The existing algorithms for identifying proteins from top-down MS data are able to achieve good performance with respect to protein-spectrum matching(PrSM)precision and prediction accuracy of PTM locations,but their efficiencies of running time and correctness of identified target PrSM are still far from satisfactory.In order to solve the problem of low operating efficiency in top-down approaches,this thesis proposes an algorithm called CUDA-TP based on compute unified device architecture(CUDA)to compute alignment scores between proteins and mass spectra.Since graphics processing unit(GPU)can be applied to parallelize large-scale replication computations,CUDA-TP can reduce the running time of serial program significantly.Firstly,CUDA-TP uses the optimized MS-Filter algorithm to quickly filter out proteins in the database that cannot possibly attain high score for a given mass spectrum,thus only a small number of candidate proteins are obtained.Then,an AVL tree is introduced into the algorithm to speed up the computation of protein-spectrum matching.Experimental results demonstrate that CUDA-TP can significantly accelerate protein identification such that its running time is about 10 times and 2 times faster than that of MS-TopDown and MS-Align+.To improve the accuracy of target PrSM identified by baseline top-down algorithms,this thesis presents a novel model called RPML base on machine learning.It is composed of three major steps: feature extraction,model construction and score integration.In feature extraction,we extract eleven features from both the initial PrSM and the corresponding spectrum as well as the protein sequence.In model construction,classification models are built to predict the probabilities of all PrSMs.In score integration,RPML aggregates the prediction results from multiple classifiers to generate a consensus probability for each PrSM.The experimental results show that RPML can distinguish more correct identifications from incorrect ones.
Keywords/Search Tags:Proteomics, Protein Identification, Parallel Computation, Machine Learning
PDF Full Text Request
Related items