Font Size: a A A

Identification Of Circular RNAs Using Genomic Sequence Features

Posted on:2019-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2370330590475419Subject:biomedical engineering
Abstract/Summary:PDF Full Text Request
Circular RNAs(circ RNAs)is a class of novel RNAs with important biological functions.Currently,circ RNAs are usually identified from high-throughput RNA-seq data using bioinformatics pipelines.Due to the high false-positive and false-negative rate of these computational tools,the overlap rate of identified circ RNAs by different tools is low.To overcome this,we developed a method to identify circ RNAs from the genomic sequence alone.We presented a model to distinguish circ RNAs from canonical linear RNAs using genomic sequence features,including the density of A-to-I RNA editing sites,the pairing score of Alu elements in the flanking introns,the distribution of binding sites of RNA binding proteins(RBPs),the length of flanking introns,etc.We found the sequence features had significant differences between circ RNAs and canonical linear RNAs.We implemented the model to identify circ RNAs from these features using two machine-learning algorithms,random forest(RF)and support vector machine(SVM).Our results showed that the selected features can effectively distinguish circ RNAs,and some sequence features had significant contributions to circ RNA classification.We also used different sets of genomic features and compared their classification performances.The main results of this thesis include:(1)We investigated the role of RBPs in the regulation of circ RNA biogenesis.We demonstrated that binding sites of RBPs are significantly enriched in regions near splicing site,including the first and last exon of the transcript and the flanking introns.GO enrichment analysis suggested that the RBPs with enriched binding sites located in the flanking region are more relevant to RNA splicing.In comparison to linear RNAs,circ RNAs had significantly different distributions of binding sites of RBPs in the flank region of splicing sites.(2)We implemented a bioinformatics pipeline to analyze several genomic features related to RNA circularization,including the density of A-to-I RNA editing sites,the binding sites of RBPs,the pairing score of Alu elements in the flanking introns,sequence compositions,etc.We showed that selected features had significant differences between circ RNAs and linear RNAs.(3)Using these genomic features and machine learning algorithms(SVM/RF),we built a machine learning model to classify circ RNAs and linear m RNAs.We confirmed the superior performance of our genomic feature based model.We ranked the genomic features based on their contributions to classification,and confirmed that top ranked features can classify circ RNAs effectively.Furthermore,we compared our genomic feature-based model to another model that is based on thermodynamic features.We observed that our model had better performance in identifying circ RNAs than the model using thermodynamic features.
Keywords/Search Tags:Circular RNAs, Sequence Feature, Machine Learning, Alternative Splicing, RF, SVM
PDF Full Text Request
Related items