Font Size: a A A

A Method For N6-methyladenosine Sites Identification Based On Sequence Characteristics And Graph Embedding Information

Posted on:2022-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:R GuoFull Text:PDF
GTID:2480306758491774Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
N6-methyladenosine modification,also known as m6A,is one of the most highly conserved post-transcriptional modifications widely existing in eukaryotic mRNA and long non-coding RNA.Studies have shown that m6A modification is related to many biological processes,including but not limited to protein translation and localization,mRNA splicing and stability,RNA localization and degradation,and so on.In addition,m6A is also associated with many human diseases such as prostate cancer and thyroid tumor.Therefore,it is of great biological significance to accurately identify m6A modification sites on RNA sequences.The traditional wet experiment methods to detect m6A modification sites in mRNA face some limitations,such as the large cost of time and money,the complexity of experimental processing,and the difficulty of large-scale site identification.In recent years,researchers have proposed a variety of m6A modification site predictors based on machine learning.In the construction of predictors,the selection of sequencederived feature encoding methods is the key factor affecting their performance.However,most of them extract features directly based on a single RNA sequence,only a few methods extract statistical information from positive and negative datasets respectively,and very few dig effective information from the whole dataset and the relationship between samples.The classification feature information in sample similarity is not fully utilized.Therefore,the prediction accuracy can be further improved.Based on the above problems,this paper proposes a computational method for identifying m6A modification sites which utilizes both traditional sequence-derived features and graph embedding information.After studying the sequence feature coding methods commonly used in the existing m6A modification site,the nucleotide composition-transformation-distribution(CTD),k-spaced nucleotide pair frequencies(KSNPF),nucleotide chemical property density(NCP-ND),nucleotide pair position specificity(NPPS),biprofile Bayes(BPB),electron-ion interaction pseudopotential(EIIP)and Pseudo k-tuple Composition(Pse KNC)encoding methods are used.Using the fast linear neighbor similarity approach(FLNSA),the sample-sample similarity network is constructed based on the sequence features.The graph embedding features of each sample in the network are learned through three graph embedding algorithms:SocDim,node2 vec,and GraRep.Finally,the sequence features and graph embedding features are combined into input vectors,and the predictor named m6AGE is trained based on the CatBoost classifier.This predictor combines both sequence-derived features and graph embeddings for m6A site prediction for the first time.The graph embedding features of each sample are fully learned in an unsupervised manner by using three graph embedding algorithms in the sample similarity network.The graph embedding features contain the potential relationship information between samples obtained from the global dataset,Therefore,the graph embedding features can be used as an important supplement to sequence-derived features and further improve the performance of the predictor.In this paper,four datasets were collected,which involved three species:Arabidopsis,Saccharomyces cerevisiae,and human.Using these four datasets,contrast experiments were carried out on the feature combination,classifier selection,comparison with other existing predictors,and the performance on the imbalanced dataset,which further verifies the effectiveness of the method proposed in this paper for identifying m6A modification sites.To make the predictor more convenient for researchers to use for free,an online prediction system based on the method proposed in this paper is constructed.The website is http://www.m6 age.cloud.
Keywords/Search Tags:bioinformatics, m6A, N6-methyladenosine, graph embedding, CatBoost
PDF Full Text Request
Related items