Font Size: a A A

Research On Data Mining Based Prediction Of Protein Sumoylation Sites

Posted on:2016-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:J L ZhangFull Text:PDF
GTID:2308330461466594Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
SUMOylation is one of the important types of post-translational modifications(PTMs), which plays an essential role in myriad cellular processes ranging from protein folding and maturation to signal transduction. However, SUMOylation sites are commonly identified by experimental approaches, which are laborious and expensive. As an alternative, bioinformatics approaches become the current research focus by using data mining and machine learning methods. These methods can be used in a high-throughput manner to predict and prioritize potential SUMOylation substrates and sites. In this study, we carried out some research work which focused on the prediction of protein SUMOylation sites by using the data mining technology. The research outcomes are as follows:(1) A feature selection and ensemble learning method was proposed to predict the protein SUMOylation sites. The major steps of the method are as follows: Firstly, we preprocessed the datasets of three species, and extracted the features according the bioinformatics method to transform the samples into vectors. Secondly, we applied a two-step feature selection method to filter redundant and irrelevant features and select a condensed feature subset. Finally, we built the predict model by using the random forests algorithm. The experimental results show that, our method can extract the effective consensus motif. The proposed mRMR+FFS algorithm can reach the AUC performance about 0.861, 0.966, 0.970 on the species of H. sapiens, M. musculus and S. cerevisiae, and the proposed mRMR+IFS algorithm can reach the AUC performance about 0.824, 0.951, 0.921. Compare with seeSUMO and GPS-SUMO algorithms, our method can get the increase of AUC performance about 10% on three species.(2) Due to the limitation of the current prediction methods, it is difficult to select non-SUMOylated negative samples. We proposed a novel positive and unlabeled learning(PU learning) algorithm, averaged n-dependence decision tree(P-AnDT) algorithm, for SUMOylation prediction. The major steps of the method are as follows: Firstly, we arranged positive and unlabeled SUMOylation datasets, and then we selected the top 500 significant features as the foregoing feature selection method. At last, we built the final PU learning predict models by P-AnDT algorithm. The experimental results on rearrangement PU datasets show that, the proposed P-AnDT algorithm can reach the F1 performance about 0.784, 0.693; 0.741, 0.711; 0.702, 0.680 on benchmark and independent test datasets of three species. Compare with other bayes-based PU learning algorithms, such as PNB, PTAN, etc, the P-AnDT can get the increase of F1 performance about 8% on three species.
Keywords/Search Tags:SUMOylation site prediction, post-translational modifications, feature selection, PU learning
PDF Full Text Request
Related items