Font Size: a A A

Prediction Research Of Protein Glycosylation And Phosphorylation Sites Based On Support Vector Machine

Posted on:2017-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y XiangFull Text:PDF
GTID:2310330512969708Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Prediction research of protein post-translational modification sites is an important task in proteomics. Traditional experimental methods to recognize protein sites is time-consuming, while machine learning which includes data pre-processing, sequence representation, feature selection, model building and model verify five steps, is such an effective means to quickly resolve this bioinformatics issues,. Sequence representation is the key element of sites recognition. On the basis of position and statistical difference table of chi-square test, our study has developed a novel feature to present protein sequence chi-square score difference table method(?~2-pos), which has more distinct advantages, such as less dimension, low redundancy, not sparse in feature matrix, etc. Based on the new position feature, this paper conducted classification recognition on O-glycosylation and phosphorylation data sets, respectively. The predicted results are as follows:O-glycosylation site prediction:Glycosylation is of the most common major modification process in post-translational of protein. The prediction of O-linked glycosylation sites with a high accuracy is a challenging problem because the O-linked glycosylation is not yet identified to occur on any consensus sequence. On the basis of the largest-ever Steentoft database, we used ?~2-pos, pseudo position-specific scoring matrix (PsePSSM) and undirected composition of k-spaced amino acid pairs (Undirected-CKSAAP) to present protein sequences, and constructed 5 support vector machines models based on the same proportion of positive and negative samples. By weighted voting, Matthew's correlation coefficient, area under ROC curve and the prediction accuracy reached 0.79,0.96 and 89.62%, respectively; on the same dabatase, steentoft et al. used transmembrane prediction, surface accessibility and protein disorder to present protein sequence, and construdcted a balanced support vector machines classifier. Matthew's correlation coefficient and the prediction accuracy were 0.71,83%, respectively; our result was superior to the literature.Phosphorylation site prediction:phosphorylated protein is a major post-translational modification, which can be divided into specific and non-kinase kinase specific types. Due to the current substrate and kinase-related information was incomplete, so the paper focused on the non-kinase specific prediction methods. On the basis of Dou database, we used ?~2-pos and PsePSSM to present protein sequence, built a balanced positive and negative samples support vector machine classifier. The Matthew's correlation coefficient, area under ROC curve and prediction accuracy of the S/T/Y reached 0.59/0.55/0.50,0.87/0.85/0.81, 79.74%/77.68%/75.22%, respectively; on the same data, Dou et al. used Shannon entropy, relative entropy, predicted protein secondary structure, predicted protein disorder, solvent accessible area, overlapping properties, averaged cumulative hydrophobicity and k-nearest neighbor eight features to present protein sequence, built a balanced positive and negative samples support vector machine classifier. The area under ROC curve and prediction accuracy of the S/T/Y reached 0.78/0.67/0.60, respectively; our results are significantly better than that reported in the literature.The ?~2-pos has the widespread application prospect in protein sequence representation.
Keywords/Search Tags:protein post-translational modification, O-glycosylation, phosphorylation, Sequence representation, ?~2-pos
PDF Full Text Request
Related items