Font Size: a A A

Research On Classification Of Protein Post-translational Modification Sites Based On Machine Learning In Imbalanced Data Set

Posted on:2022-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y X ShenFull Text:PDF
GTID:2480306317468434Subject:Big data science and application
Abstract/Summary:PDF Full Text Request
Post-Translational Modification(PTMs)is an important mechanism for regulating protein functions,and it plays a major role in biological processes and signaling pathways.Normal PTMs can regulate the physiological functions of proteins,while abnormal PTMs can lead to changes in protein conformation,dysfunction,and loss of physiological activity,causing diseases.Therefore,the identification of modified sites helps to understand the cellular functions and molecular mechanisms of proteins.In addition,PTMs prediction is a typical classification problem of unbalanced data sets.Since traditional machine learning algorithms are not suitable for unbalanced data sets,it is necessary to explore effective methods to balance the data sets.Aiming at the problem of predicting protein S-sulfenylation sites,this article attempts to use two methods to deal with unbalanced data sets.The first,for the data level,using the resampling algorithm to perform SMOTE oversampling and One Sided Selection under sampling operations on the training data,then use a balanced training set to build a predictor based on the random forest algorithm.The second,for the algorithm level,based on the idea of integrated learning,using ensemble random forest algorithms to build the predictor.Through a large number of experiments,the performance of two predictors is analyzed and compared.Under the S-sulfenylation data set selected in this paper,the performance of the ensemble random forest is better.For the prediction of protein succinylation sites.First,frequency vector(FV),amino acid physicochemical properties(PCP),OneHot encoding(OHE)are used as feature extraction methods.Secondly,in order to reduce the feature dimension and improve feature expression,discrete wavelet transform is used for the PCP,and Extra-Trees feature selection algorithm is used for OHE.Finally,the broad learning system is used to construct the predictor i Succ Lys-BLS,and on the basis of the broad learning system,a randomly labeling samples method is proposed to solve the problem of unbalanced data.Through a number of experimental verification and comparison with the same type of predictor,i Succ Lys-BLS has the best classification performance for positive samples.It shows that the randomly labeling samples method based on width learning has practical significance and effectiveness,and it is a new idea to solve the problem of unbalanced data.In addition,in order to facilitate the use of the research results of this paper,i Succ Lys-BLS has been deployed on the online server at http://jci-bioinfo.cn/i Succ Lys-BLS.
Keywords/Search Tags:Post-translational modification site, unbalanced data set, broad learning system, randomly labeling samples method, S-sulfenylation succinylation
PDF Full Text Request
Related items