The Research Of Intact Protein Identification Algorithm

Posted on:2019-02-05

Degree:Master

Type:Thesis

Country:China

Candidate:Q Duan

Full Text:PDF

GTID:2370330566484152

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Identifying protein and their post-translational modifications are critical to the success of proteomics.Recent advances in mass spectrometry(MS)instrumentation have made it possible to generate high-resolution mass spectra of intact proteins.The existing algorithms for identifying proteins from top-down MS data are able to achieve good performance with respect to protein-spectrum matching(PrSM)precision and prediction accuracy of PTM locations,but their efficiencies of running time and correctness of identified target PrSM are still far from satisfactory.In order to solve the problem of low operating efficiency in top-down approaches,this thesis proposes an algorithm called CUDA-TP based on compute unified device architecture(CUDA)to compute alignment scores between proteins and mass spectra.Since graphics processing unit(GPU)can be applied to parallelize large-scale replication computations,CUDA-TP can reduce the running time of serial program significantly.Firstly,CUDA-TP uses the optimized MS-Filter algorithm to quickly filter out proteins in the database that cannot possibly attain high score for a given mass spectrum,thus only a small number of candidate proteins are obtained.Then,an AVL tree is introduced into the algorithm to speed up the computation of protein-spectrum matching.Experimental results demonstrate that CUDA-TP can significantly accelerate protein identification such that its running time is about 10 times and 2 times faster than that of MS-TopDown and MS-Align+.To improve the accuracy of target PrSM identified by baseline top-down algorithms,this thesis presents a novel model called RPML base on machine learning.It is composed of three major steps: feature extraction,model construction and score integration.In feature extraction,we extract eleven features from both the initial PrSM and the corresponding spectrum as well as the protein sequence.In model construction,classification models are built to predict the probabilities of all PrSMs.In score integration,RPML aggregates the prediction results from multiple classifiers to generate a consensus probability for each PrSM.The experimental results show that RPML can distinguish more correct identifications from incorrect ones.

Keywords/Search Tags:

Proteomics, Protein Identification, Parallel Computation, Machine Learning

PDF Full Text Request

Related items

1	Predicting Protein-protein Interactions Based On Machine Learning Algorithms Using Logistic Regression Model To Improve Accuracy Of Peptide Identification In Mass Spectrometry Analysis
2	Machine learning algorithms for peptide identification and protein quantification in proteomics
3	Research On Prediction Model Of Plant Moonlighting Protein Based On Machine Learning
4	The Identification Of A Simple Connector Protein Pdzk1 Ligand Of The Ligand Functional Interrelated To Predict And Verify The Two Through The Integration Of Machine Learning Algorithm To Predict The System Efficient Identification Of Hpv 16 E6 Interaction
5	Research On Spatial Parallel Computation And Adaptive Parameter Tuning Based On Spark
6	Study On Identification Of Saliva-secretory Proteins Based On Machine Learning
7	Distributed Machine Learning Algorithms For Electromagnetic Targets Identification
8	Research On Predicting Protein-protein Interactions Based On Machine Learning
9	Prediction Of Protein Structure And Function With Machine Learning Methods
10	Key Techniques Research On Quantum Machine Learning