Font Size: a A A

Identifying Protein-coding And Long Non-coding RNA In Context Of High-throughput Sequencing

Posted on:2014-12-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:L SunFull Text:PDF
GTID:1260330425465098Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the extensive application of high-throughput sequencing technology on a globalscale, numerous novel transcripts are identified in many species, including humans, mice, rats,E. coli, C. elegans, Drosophila, Arabidopsis and other model organisms. Especially there islarge number of long non-coding RNAs are found in humans and mice and some longnon-coding RNAs have been confirmed to participate in many important life processes, suchas cell differentiation, immune response, signaling pathways and metabolic regulationpathways and so on. Explore the function of long non-coding RNAs, and the regulatorynetwork has become to a hot research topic currently. But among many species there is still alarge number of long non-coding RNAs has not yet been identified. Therefore to identifysequence differences between protein-coding and non-coding transcripts is a very urgent tast.In order to complete this work, it is necessary to overcome the following two difficulties: First,in many species, there is no complete genome annotation, especially for long non-codingRNAs. Even in human and mouse, there is only a small part of a long non-coding RNAs havebeen indentified. Second, high-throughput sequencing technology has some inevitablesequencing error rates itself, in other words in RNA-Seq experiments it will identify somebase incorrectly in a certain probability, combine with that some incomplete transcripts mayhave be generated by the processing of reconstruction, these issues will lead to a transcriptsdistortion phenomenon in the high-throughput sequencing. The above mentionedincomplete information and transcripts quality problems have made this transcriptsclassification problem becomes more challenging.To overcome these challenges, we, for the first time, developed Coding-Non-CodingIndex (CNCI) software, a powerful signature tool by profiling adjoining nucleotide triplets(ANT), to effectively distinguish between protein-coding and non-coding sequencesindependent of known annotations. Our finding was consistent with previous observationsthat coding domain sequence (CDS) regions have been under a variety of competing selectionpressures, especially the translation optimization force that may be associated with thejuxtaposition of tRNAs and that is not necessary for non-coding regions. In addition, mentioned in this article: the usage of biased frequency of adjoining nucleotide triplet havebeen supported by other researchers,their study have shown that the location which choicedby transfer RNA (tRNA) on the ribosome is preference, two tRNAs are always tend tobecome a pairwise. In accordance with the the central dogma tRNA will combine with acodon in the ribosome in order to translating into some amino acids. In a long evolutionaryprocess genome will be subject to a variety of selection pressures, particularly in CDS (codingsequence) region evolution along the direction of optimum encoding pressure. Instead of theCDS non-protein coding regions are unnecessary withstand this pressure. Based on thissignificant feature, we use some datasets which content known protein-coding and non-codingRNAs to calculate the ANT score-matrix, and utilized a sliding window to analyze eachtranscript by setting the size of the sliding window, the scan step as one ANT. This windowscanned transcript for six times to generate six reading frames. Meanwhile, during thisscanning process, CNCI calculated the S-score of each window based on ANT score-matrix;thus a given transcript will produce six discrete numerical arrays. After that, we applied adynamic programming called Maximum Interval Sum to identify a candidate MLCDSsequence of each reading frame.CNCI is particularly well suited to the transcriptome analysis of not well-studied speciesbecause it can effectively classify transcripts solely based on nucleotide composition of theirsequence. It differs from previous methods that depend on ORF information or knownannotation (such as peptide database or multispecies nucleotide sequence alignments) to findconserved regions. Therefore, CNCI has a key advantage over other methods since genomesequences have been well annotated or completely sequenced only for limited species so far,and for most species, only partial or even none of their whole genome sequences have beenknown. For these large number of species with “poorly annotated sequences, it is impossibleto utilize peptide hits or multispecies alignments to classify sequences into protein-coding ornon-coding transcripts, as different ORF cutoffs may lead to a high false negative/positive rate,especially for long non-coding RNAs. We tested CNCI on a published RNA-Seq dataset fromsix organs of orangutan. And as a result, CNCI annotated7,697novel transcripts as longnon-coding RNAs, which contributed to the first comprehensive orangutan long non-codingRNA catalog.
Keywords/Search Tags:Long non-coding RNA, coding index, classification, high-throughput sequencing, adjoiningnucleotide triplets
PDF Full Text Request
Related items