Font Size: a A A

Theoretical Study Of RNA Coding Potential

Posted on:2021-01-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:X X TongFull Text:PDF
GTID:1480306107455914Subject:Theoretical Physics
Abstract/Summary:PDF Full Text Request
Approximately 2% of the sequences in the human genome are transcribed into mRNA,and a large number of sequences,which were previously thought to be "junk" are also transcribed into non-coding RNAs(ncRNAs).In order to search for genes from thousands of sequences,many mature algorithms have been developed in the field of bioinformatics,and they have also achieved encouraging results in gene prediction.However,as research continues,researchers has discovered some non-coding RNAs are in fact mRNAs that contain small open reading frames(small ORFs,sORFs).Most previous bioinformatics algorithms for finding mRNA were limited to more than 300 nucleotides.This defect leads to misclassification for some “long ncRNAs(lncRNAs)” containing sORFs with genetic annotation software.So we propose a coding potential prediction method CPPred,based on support vector machine(SVM)classifier with multiple features,which include novel RNA features encoded by the global description.The CPPred can better distinguish not only between coding RNAs and ncRNAs,but also between small coding RNAs and small ncRNAs than the state-of-the-art methods due to the addition of the novel RNA features.We also reveal that the global description of encoding features(T2,C0 and GC)play a key role in the prediction of coding potential,which can capture RNA folding structural features.The CPPred has high accuracy prediction on human,mouse,zebrafish and Saccharomyces cerevisiae testing sets.The accuracy of CPPred is improved compared to previously published tools on the testing sets of these species.The CPPred has a particular advantage in the small RNA(sORF)testing sets of these species,which is a big improvement over the tools developed before.On the basis of the CPPred,we develop CPPred-sORF by adding two features and using non-AUG as the starting codon,which makes a comprehensive evaluation of sORF.The CPPred-sORF constructs data that contains small coding RNA and lncRNA as positive and negative data,respectively.Compared to the small coding RNAs and small ncRNAs,lncRNAs and small coding RNAs are less distinguishable.This is because the longer the sequences,the easier to include open reading frames.We find that the sensitivity,specificity and MCC value of CPPred-sORF on the independent testing set can reach 88.22%,88.84% and 0.768,respectively,which shows much better prediction performance than the other methods.Furthermore,we develop the CircPred to distinguish between circular RNA(circRNA)and coding RNA,circRNA and lncRNA.Based on the CPPred feature,the CircPred adds AT-GT signal characteristics related to splicing,because it's closely related with circRNA.We obtain non-redundant dataset of circRNAs,coding RNAs and lncRNAs.In the prediction of circRNA and lncRNA,the sensitivity,specificity and MCC values obtained by CircPred are 85.17%,96.56% and 0.833,respectively.And for circRNA and coding RNA,the sensitivity,specificity and MCC values of the CircPred are 65.58%,98.75% and 0.739,respectively,which are better than the other tools for predicting coding potential.
Keywords/Search Tags:Coding Potential, Coding RNA, non-coding RNA(ncRNA), long ncRNA(lncRNA), Circular RNA(circRNA), Small Open Reading Frames(sORFs)
PDF Full Text Request
Related items