Font Size: a A A

Research On Methods Of Lysine Post-translational Modification Sites Prediction Based On Support Vector Machine

Posted on:2017-01-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z JuFull Text:PDF
GTID:1310330512461462Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Post-Translational Modifications ?PTMs? are chemical modifications of a protein after translation, which play an important role in regulating conformational changes, activities and functions of proteins. To study the modification dynamics and molecular mechanism of PTMs preferably, the fundamental but crucial step is the accurate identification of PTMs sites. In the last decade, the researchs of PTMs sites prediction based on machine learning have developed rapidly, which have become an important research field and a hot spot in bioinformatics. According to the current research progress of PTMs sites prediction, Support Vector Machine ?SVM? and corresponding improved algorithms are used to predict protein PTMs sites from amino acid sequences in this paper. The main research contents are as follows:1. Due to the fact that the prediction accuracy of existing methylation sites predictors is still unsatisfactory, and those predictors are only focused on predicting whether a query lysine residue is a methylation site without considering its methylation degrees, a novel two-level predictor named iLM-2L is proposed to predict lysine methylation sites and their methylation degrees. First, the composition of k-spaced amino acid pairs features has been applied to training a prediction model of lysine methylation sites and enhancing the prediction accuracy of methylation sites. Then, the prediction of methylation degrees is modeled as a multi-label learning classification problem and trained by multi-label SVM algorithm. Computational results indicate that the predictive performance of iLM-2L outperforms five existing predictors:MeMo, MASA, BPB-PPMS, PMeS and iMethyl-PseAAC. Moreover, iLM-2L can also effectively predict methylation degrees of methyllysine sites. The analysis of the optimal k-spaced amino acid pairs shows the potential sequence patterns around methyllysine sites. Based on the iLM-2L model, a corresponding online web-server is established ?http://123.206.31.171/iLM2L/?.2. A novel model called IMP-PUP is constructed to predict pupylation sites in prokaryotic proteins. In consideration of the shortage of pupylation sites data, a modified semi-supervised self-training SVM algorithm is proposed as core learning algorithm of IMP-PUP. The proposed self-training SVM algorithm can take full advantage of the information of non-annotated pupylated proteins in PupDB, and improve the prediction of pupylation sites. Here, a minimum distance rule is introduced to design the confidence function, by which the proposed algorithm selects the predicted unlabeled samples nearest to the labeled set instead of selecting the samples with the highest SVM scores. This method can overcome the disadvantage that the misclassification may take place during the initial stage of iterative training in the original self-training SVM algorithm. Computational results indicate that IMP-PUP significantly outperforms other three existing pupylation sites predictors: GPS-PUP, iPUP and pbPUP. A user-friendly web-server for IMP-PUP is established at http://123.206.31.171/IMPPUP/.3. A novel predictor called CKSAAPPhoglySite is developed to predict lysine phosphoglycerylation sites. To solve the problems of the data imbalance and noisy problem in the prediction of protein phosphoglycerylation sites, a novel fuzzy SVM algorithm is proposed. The fuzzy membership of the proposed algorithm is defined not only by the distance between the training sample and its class center, but also by the closeness around the training sample. Moreover, by conducting several assessments, it is found that the composition of k-spaced amino acid pairs is more suitable than other encoding schemes for representing the protein sequence around the phosphoglycerylation sites than other encoding schemes including amino acid composition, binary encoding, position specific scoring matrix and secondary structure. The CKSAAPPhoglySite model is constructed based on the proposed fuzzy SVM and the composition of k-spaced amino acid pairs. The jackknife test results show that the predictive accuracy of CKSAAPPhoglySite is 14.2% higher than that of Phogly-PseAAC. A free online service for CKSAAPPhoglySite is accessible at http://123.206.31.171/CKSAAPPhoglySite/.
Keywords/Search Tags:Bioinformatics, Protein Post-Translational Modification, Methylation, Support Vector Machine, Multi-Label Classification
PDF Full Text Request
Related items