Font Size: a A A

The Study For The Prediction Of RNA Methylation Modification Sites Based On Deep Learning Algorithm

Posted on:2023-02-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:1520306908968419Subject:Information management and information systems
Abstract/Summary:PDF Full Text Request
The RNA modification in the epitranscriptomics is the basic cellular process which is necessary for organisms to perform function.To date,more than 170 kinds of posttranscriptional modifications of RNA have been identified,where two thirds of those modifications are methylated.RNA methylation refers to methylation modifications occurring at different locations of RNA nucleases,among which 5-methylcytosine(m5C),N7-methylguanosine(m7G)and N6-methyladenosine(m6A)are the most representative methylation modification types.These RNA methylation modifications play a crucial role in the structure,function and metabolism of RNA.Accumulating evidence shows that m5C,m7G and m6A modifications are associated with the pathogenesis of many diseases.In view of the importance of these three modifications,accurate determination of the distributions of m5C,m7G and m6A in transcriptome is the basis for the in-depth understanding of their biological function and modification mechanisms.Furthermore,this can also aid in developing targeted drugs and determining the pathogenesis of the related disease.Studies have shown that high-throughput sequencing methods can accurately identify the modification sites,but they are expensive and time-consuming when performing transcriptome-wide detections.Therefore,it is imperative to design a computational method that can identify the modification sites accurately,effectively and efficiently.Currently,researchers have proposed some machine learning-based computational tools to identify methylation modifications while there is still room for improvement in predictive performance.Aiming at three types of RNA methylation modifications(i.e.,5methylcytosine,N7-methylguanosine and N6-methyladenosine).we propose and construct three novel computational prediction methods based on machine learning from four aspects:feature extraction of sequence data,selection of important features,integration of machine learning algorithms and training strategies.At the same time,we design a prediction platform based on Flask framework to achieve accurate prediction for the RNA methylation modification sites.The specific research contents are as follows:(1)A new method,IFS-LightGBM,for predicting modification sites of RNA is proposed.It utilizes incremental feature selection method and Light Gradient Boosting Machine(LightGBM)feature selection method to build feature selection scheme,and chooses Random Forest as classifier.Firstly,four feature extraction methods including binary encoding(BE),position-specific nucleotide propensity(PSNP),pseudo dinucleotide composition(PseDNC)and nucleotide chemical property(NCP)are used to convert RNA sequences into feature vectors.Secondly,a new feature selection scheme based on the LightGBM feature selection method and incremental feature selection method is designed to remove redundant and noise information in fusion feature set.Finally.Random Forest algorithm which can obtain the best prediction performance after combining with the feature selection scheme is selected to construct the prediction model.The above procedures for constructing the prediction model are considered as IFS-LightGBM,whose accuracy achieves 91.67%and MCC value is 0.8352 on the dataset,meanwhile,the accuracy and MCC value of which is 5.01%-25.35%and 0.1032-0.4852 higher than those of the state-ofthe-art methods.These experimental results indicate the effectiveness of the prediction method IFS-LightGBM.(2)A prediction method named BERT-m7G based on bidirectional encoder representations from transformers(BERT)and stacking ensemble classifier is constructed to identify RNA modification sites.In order to better acquire hidden information that helps to predict the modification sites.BERT-m7G employs the original RNA sequences as the model input.This is the first time that BERT utilizes to convert RNA sequences into feature descriptors.Firstly,these RNA sequences are treated as natural sentences and then uses BERT model to transform them into the numerical matrices with a fixed length.Secondly.the feature selection scheme based on the elastic net method is built to eliminate redundant features and reserve important features,which can reduce the search time without influencing the prediction performance.Finally,the tree-structured parzen estimator(TPE)method is used for hyper-parameter adjustment of stacking ensemble classifier to establish the best model.The results indicate that the ACC,SN,SP and MCC of the proposed BERTm7G are 95.5%,95.8%,95.1%and 0.910,respectively.Compared with state-of-the-art methods,the ACC is advanced by 3%-20.7%,and the MCC is improved by 0.06-0.415,which demonstrates that BERT-m7G has excellent prediction performance and outperforms other state-of-the-art prediction methods.(3)A novel cross-species computational method DNN-m6A based on the deep neural network(DNN)is proposed to identify RNA modification sites in multiple tissues of mammals.Firstly,multiple different feature extraction methods,such as binary encoding(BE),tri-nucleotide composition(TNC),enhanced nucleic acid composition(ENAC),Kspaced nucleotide pair frequencies(KSNPFs),nucleotide chemical property(NCP),pseudo dinucleotide composition(PseDNC),position-specific nucleotide propensity(PSNP)and position-specific dinucleotide propensity(PSDP),are utilized to extract the sequence features for every RNA sequence.And then parameter selection is made for the two sets of features including PseDNC and KSNPFs.After determining the optimal parameters of those two methods,the eight individual feature sets are fused to gain the initial feature vector set.Secondly,we use feature selection methods with the optimal parameters to establish the most appropriate feature selection scheme for feature reduction.Finally,multiple hyperparameters of Deep Neural Network are tuned with Bayesian hyper-parameter optimization based on the selected feature subset.The cross-validation test on training datasets shows that the prediction accuracy and the area under the curve(AUC)of DNN-m6A are 73.58%-83.38%and 80.79%-91.09%,respectively.Furthermore,the independent test datasets obtain an ACC of 72.95%-83.04%and an AUC of 80.79%-91.09%.The comprehensive comparison results on the training datasets and the independent datasets indicate that the prediction performance and generalization ability of DNN-m6A outperform the state-of-the-art methods.(4)For user’s convenience,the RNA methylation modification sites prediction platform based on the Flask framework is constructed,which effectively integrates the three RNA methylation modification sites prediction methods proposed in this paper.The user only needs to upload the RNA sequence to be tested and chooses the type of modification,then the corresponding prediction result can be obtained online.
Keywords/Search Tags:post-transcriptional modification, machine learning, feature selection, deep neural network, stacked ensemble classifier, BERT, bayesian hyper-parameter optimization
PDF Full Text Request
Related items