Font Size: a A A

A General Computer Predicted Method For Modified Sites In Protein

Posted on:2019-12-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:S P WangFull Text:PDF
GTID:1360330548985790Subject:Bioinformatics and Systems Biology
Abstract/Summary:PDF Full Text Request
Post-translational modifications(PTMs)that occur on the protein sequences are the important ways to regulate protein functions in prokaryotic and eukaryotic cells.PTMs together with protein-protein interactions play an important role in signal pathway transduction,apoptosis,enzyme regulation and protein subcellular localization in cells.In recent years,with the development and usage of protein sequencing,immunoprecipitation,mass spectrometry,in vitro biochemical assay,immunofluorescence and proximity ligation assay in proteomics,more and more PTMs are reported successively and diversify the annotations and functional information in databases like UniProt.Althrough PTMs can be detected and identified accuractely by using the aforementioned methods,validating these PTMs is still a tough challenge faced by researches in the level of proteomics.Based on the experimental data,some researchers have proposed many machine learning models to recognize PTMs sites in protein sequences.However,for some functionally important PTMs sites,the proposed models share shortcomings of low prediction accuracy or even no prediction models were constructed.Thus,to overcome the aforementioned drawbacks,we proposed a novel computational framework to recognize and predict some of the important PTMs sites in protein sequences in the guidance of my supervisor Yu-Dong Cai and tested the reliability,accuracy and efficiency of our predicted methods with analyses.The major content in the thesis are listed as follows:1.Predicting tyrosine nitrated sites in protein sequencesWe used amino acid features,position-specific scoring matrix features,amino acid factor features and disorder features to represent residues in peptide segment containing nitrated tyrosine residues and constructed feature vector with 961 features.By applying maximum relevance minimum redundancy method to rank these features,we adopted incremental feature selection and support vector machine to build classifiers.As a result,the optimal classifier achieved a Mathews'correlation coefficient(MCC)of 0.717 on the training set by adopting 10-fold cross-validation test and sensitivity(SEN)of 0.950 on test set.At the same time,another three machine learning algorithms were also performed to build classifiers and make a comparison with the prediction ability of classifier derived from support vector machine.Finally,we applied literature-based feature analysis on some essential features.2.Predicting glycine myristoylated sites in protein sequencesBy using the same four types of sequences-based features as in nitration sites,we constructed predicted models on N-terminus myristoylation sites from UniProt database.In this model,we applied a type of widely used artificial neural networks(ANN)—extreme learning machine and constructed a three-layer ANN to determine peptide segment which represented myristoylation sites.On training and test sets,the optimal ANN classifier obtained MCC and SEN of 0.983 and 0.787,respectively.As the same time,we also performed biological analysis on the 41 optimal features.3?Predicting lysine malonylation sites in protein sequencesAs a PTM found on the side chain of lysine residues in recent years,malonylaton sites together with succinylation sites,propionylation sites and butyrylation sites have drew many attentions from researchers.In our study,classifiers that were derived from feature selection and random forest were used to recognize malonylation sites in protein sequences.By observing incremental feature selection curve(IFS-curve),the optimal random forest classifier received an F-measure of 0.356 when using the-first 593 features.Based on an analysis for some essential features,we found some site-and residue-preferences around malonylation sites.4?Predicting thioether bond in lantibioticsAs an antibiotics derived from bacteria,the lanthionine(Lan)and?-methyllanthionine(MeLan)in lantibiotic can form ring structures in the peptides,which is a key structural factor for the biological functions of lantibiotics.In our study,we firstly proposed a machine learning model to recognize Lan and MeLan residues.By using four algorithms including nearest neighbor algorithm,Dagging algorithm,support vector machine and random forest to search the optimal classifiers,we found that the optimal classifiers based on random forest received MCCs of 0.813 and 0.769 on determining the two types of residues,respectively,which showed that the optimal classifiers had good predicted abilities on recognizing them.5?Recognizing acetylation,sumoylation and ubiquitination sites in lysine residuesBecause of the physicochemical properties of the side chain of lysine residue,a lot of PTMs can occur on its side chain.Among them,acetylation,sumolytion and ubiquination as the first three abundant PTM types play key roles in the signal pathways and metabolism reactions in the cell.Therefore,we collected peptide segments containing the three types of PTM sites and constructed machine learning classifiers based on six types of features to recognize three of them,simultaneously.As a result of 10-fold cross-validation test,the optimal classifier obtained a total accuracy of 0.989.Finally,based on the optimal features,we further explored the site-and residue preferences around the three PTM sites.6?The recognition of cleavage sites in signal peptidesIn the N-terminus of some preprotein sequences,there are some sequence segments called signal peptides,which are dominant factor on determining the subcellular localizations.When preproteins are transferred to their destinations,the signal peptides are catalyzed and cleaved off in the cleavage sites from the preprotein and release the mature proteins.By removing protein sequences with trans-membrane regions,we obtained 2,863 protein sequences in 722 species with experimentally-validated signal peptides from UniProt database to build predicted models and recognize the cleavage sites and signal peptides.We balanced our dataset by adding synthesized positive samples based on synthesized minority over-sampling technique.Then the classifiers can be constructed on a balanced dataset and improved the predicted abilities of the classifiers on recognizing positive samples.The optimal classifiers derived from Dagging algorithm and random forest received Youden's Index of 0.871 and 0.736,respectively,which indicated that the optimal classifiers had outstanding abilities on identifying cleavage sites and signal peptides in prokaryotic and eukaryotic cells.
Keywords/Search Tags:post-translational modifications, maximum relevance and minimum redundancy, incremental feature selection and machine learning algorithm
PDF Full Text Request
Related items