Font Size: a A A

Method For Lnc Rna Prediction Based On Sequence-structure Information

Posted on:2015-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2180330464968789Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Thousands of novel transcripts have been identified by a new generation of deep transcriptome sequencing, but most transcripts cannot encode protein and they are once thought to be “drunk” genes. Nevertheless, the progress of current life science research is turning this knowledge gradually. Long noncoding RNA(lnc RNA), one kind of noncoding RNA molecule with length above 200 nt, has become one of hotspots in the genome research. Although a rash of lnc RNAs are found in various life biological pathways, their molecular mechanisms are yet poorly understood. The mechanisms of lnc RNA actions are diverse and complex. This emergence of large-scale ‘hidden’ transcriptome rejuvenates the demand for methods that can rapidly distinguish between m RNAs and lnc RNAs.Traditional experimental technologies, such as microarray etc., are mainly focusing on the identification of protein-coding RNA transcripts. In the present computational prediction methods, the aligment-based strategies such as CPC(Coding-Potential Calculator)、Phylo CSF(Phylogenetic Codon Substitution Frequencies) etc., are relied on the sequence conservation and the quality of the existing protein libraries; and yet the machine learning strategies such as CPAT(Coding-Potential Assessment Tool) etc., just simply utilize some biological features extracted from the coding potential perspective. But some lnc RNAs are evolved from m RNAs, and they will also show the homology of the existing proteins, even the Open Reading Frame(ORF), the sequence or secondary structure conservation etc., what makes them probably be misjudged. That shows, only these typical biological features are not enough to predict lnc RNAs accurately now.However, from a sequence-structure point of view, the sequence-structure specificities of lnc RNAs will provide new features and ideas for prediction. Here, on the basis of clearly specific biological features(such as ORF and protein sequence similarity etc.), we perform the analysis and extraction of sequence-structure information. Thus, a novel method for lnc RNA prediction is presented by integrating the prior knowledge with sequence-structure features as a new filtering criteria. The 95,105 human lnc RNAs in NONCODE and 40,730 human m RNAs in UCSC are selected as positive and negative sample data sets respectively. The SVM(Supporting Vector Machine) and Na?ve Bayesapproaches are applied to establish classification models to determine lnc RNAs. By performing cross validation, the accuracy is indicated to be greater than 96%. Meanwhile, the CPAT and CPC that cannot integrate sequence-structure features are used to be the precision comparison methods with accuracy of nearly 6% and 30% higher than that of CPAT and CPC respectively. The results reason that sequence-structure features have a certain effect on improving precision in the prediction of lnc RNAs. In the end, the feature set is optimized to lower false negative rate and the analysis of potential biological implications is made for the optimized features.
Keywords/Search Tags:Long noncoding RNA, Prediction, Biological characteristics, Sequence-structure features
PDF Full Text Request
Related items