Font Size: a A A

The Study For The Prediciton Of Protein Post-translational Modification Sites Based On Sequence Information

Posted on:2020-05-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:X G NanFull Text:PDF
GTID:1360330596970158Subject:Intelligent Environment Analysis and Planning
Abstract/Summary:PDF Full Text Request
Tremendous amounts of minable data for biological research have been accumulated by sequencing technology with the completion of the Human Genome Project and the coming of the Post-Genome era.Genetic information is stored in DNA according to Central rule of molecular biology,but it is the protein that really performs the biological function.The precursor protein translated from mRNA has no biological activity,and it needs a series of processes called post-translational modification to become a mature protein with biological function.Post-translational Modifications are the bases for proteins to perform their normal biological functions.Many studies have shown that Pupylation,ubiquitination and succinylation modification on protein lysine residues is closely related to the occurrence of many diseases.Elucidating the process of protein post-translational modification and its intrinsic regulatory mechanism is a prerequisite for revealing the mechanisms of these diseases and their accurate treatment.The key beginning step to study the post-translational modification of proteins is to find the modifiable proteins and their binding sites.The identification of protein post-translational modification sites by biological experiments is time-consuming and costly,and the enzymatic reaction of post-translational modification is a time-consuming process,which seriously restricts the development of the research for post-translational modification site recognition.With the development of bioinformatics and computational biology,some post-translational modification site recognition techniques based on computational methods are proposed,which can identify protein post-translational modified sites efficiently and accurately.Furthermore,they can provide necessary clues for biological experimental research.Based on the sequence information of protein,the recognition methods of post-translational modified sites on lysine residues are studied in this thesis.The main research contents are as follows.A new method for protein Pupylation site recognition called EPuL is proposed.The innovation of this method lies in the construction of the initial reliable negative sample set.The construction of initial reliable negative sample sets is critical to the overall performance of the algorithm for positive-unlabeled learning(PU learning)processes.In this study,an initial reliable negative sample construction method based on classifier is proposed.After the initial reliable negative sample set is constructed,it is extended by an iterative process,and the final reliable negative sample set is constructed eventually.A final training set is constructed by merging the final reliable negative sample set and the positive sample set.It is used to train a final support vector machine classifier for Pupylation locus recognition.The results of cross-test on the training set and independent test on the independent sample set show that the proposed method is superior to the existing methods in prediction performance.In addition,a number of potential Pupylation sites were identified from the pupylation protein sequences with unannotated sites.Finally,a user-friendly Web server is developed to provide free protein Pupylation site prediction services.An algorithm based on semi-supervised learning and ensemble learning is developed for the prediction of protein ubiquitination sites.Seven methods,such as pseudo amino acid composition,protein disorder scoring,physicochemical properties of amino acids,position specificity score matrix and composition of k-spaced amino acid pair,sequence binary encoding,K nearest neighbor score,are used to extract the sequence features and eight feature vectors are produced for every sequence.Firstly,a reliable negative sample set from the unlabeled sample set is gradually constructed using an improved positive sample only learning(PSoL)algorithm according to eight feature vectors to train the prediction model.The prediction model is the random forest algorithm based on integrated learning strategy.Firstly,a random forest model is trained with each single feature,and finally the final prediction results are obtained by integrating the prediction results of the eight models with the logical regression algorithm.The results of 10 times cross test on training set and test results of independent test set show that the proposed method can effectively identify species-specific protein ubiquitination sites and protein ubiquitination sites in comprehensive cross-species data and improves the performance of the existing ubiquitination site prediction algorithm.Finally,the results of feature analysis show that the prediction effect of combined feature is higher than that of each single feature,which proves the validity of feature combination.The comparison between the random constructed negative sample set and the reliable negative sample set constructed in this paper proves that the reliable negative sample extraction strategy based on semi-supervised learning is helpful to improve the performance of the algorithm.A depth learning framework called SucDeep for protein succinylation site prediction is proposed.Firstly,a new sequence feature extraction method based on the composition of kspaced amino acid pairs was designed.A 21×21-dimensional matrix is used to represent the number of times each amino acid pair appears in a sequence in the method.Each matrix can represent the amino acid pair composition of a given space.Then the matrix representing multiple spaces are combined to form a matrix sets like a multichannel image as the characteristic of the prediction sequence.This multi-channel feature matrix is a sparse integer matrix,which is similar to the representation of computer images and is suitable for deep learning models.At the same time,the position-specific score matrix is used to extract the feature of the sequence and convert each sequence into a 20-dimensional square matrix.Then a semi-supervised learning algorithm based on spy technology is developed to construct a reliable negative set of samples from unlabeled samples.The prediction algorithm used in this study is a kind of deep learning framework.The deep learning framework consists of two multi-layer convolution subnetworks;each of them is composed of three convolution layers,three pooling layers and three fully connected layers.A full connection layer is used to merge the features generated by the two sub-networks and makes the final decision.The Bootstrapping strategy is used in the training process of the model,which effectively avoids the influence of the imbalance of training set on the performance of the algorithm.Finally,a large scale succinylated data set is constructed to test the algorithm.The results of 5 times cross test on the training set and the independent test results on the independent test set show that,the proposed algorithm improves the prediction performance of the existing succinylation prediction algorithm.
Keywords/Search Tags:Protein posttranslational modification, Pupylation site, Ubiquitination site, Succinylation site, PU learning, Semi-supervised learning, Deep learnin
PDF Full Text Request
Related items