Font Size: a A A

Studies On The Prediction Of Long Non-coding RNAs Based On Deep Neural Network

Posted on:2020-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:S S LiuFull Text:PDF
GTID:2370330575493570Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
High-throughput transcriptome sequencing(RNA-seq)technology has been widely developed and used for discovering millions of novel transcripts,especially long non-coded RNAs(long non-coding RNAs,IncRNAs).Handful of research and study has implicated that among these transcripts,only 1%to 2%of genes can encode proteins,and the rest can function as non-coding RNAs(ncRNAs).The ncRNAs were referred as "dark matter or junk,before attracting biologists for their significant roles in biological progress.The IncRNAs are defined as molecules composed of more than 200 nucleotides that do not code for proteins,and are characterized by poor sequence conservation and low expression.They also participate in complicated biological function.In recent years,with the development of high-throughput RNA-seq sequencing technology,a great quantity of IncRNA transcripts have been discovered.However,due to the limitations of biotechnology,only few functional IncRNA have been identified.For example,Xist(X-inactive specific transcript),which dominates the process of X chromosome inactivation on mammal X chromosome.HOTAIR(HOX antisense intergenic RNA)governs epigenetic regulation by the mode of chromatin-modification and gene silencing regulation of gene2.Distinguishing the IncRNA from protein-coding transcripts or messenger RNAs(mRNA)accurately is fundamental for systematic characterization of the IncRNA.Traditional tools based on computional method for IncRNA prediction such as CPC(Coding-Potential Calculator),CNCI(Coding-Non-Coding Index),CPAT(Coding-Potential Assessment Tool),etc.have been released.The features selected by CPC including the length and quality of open reading frame(ORF),and support vector machine(Support Vector Machine,SVM)model is used to train the data.Though CPC has certain fault tolerance,the prediction is highly dependent on the accuracy of the protein library and sequence conservation.The characteristics of codons are used as training features for the CNCI classifier.Compared with other methods,CNCI is less effective and time-consuming.CPAT integrates the features of multiple species for logical regression(Logistic Regression,LR)model to achieve a better result.In view of the fact that the annotation of IncRNA is relatively consummate,this paper proposes a computional tool named IncRScan-DNN based on deep neural network(DNN)to classify the IncRNA and mRNA.Compared to other existing classifiers,DNN is a fast and accurate algorithm for classification.Several features including ?-mer information,transcript length,codon length(CDS_length),codon length ratio(CDS-percentage),score of txCdsPredict prediction(CDS_score)and standard deviation of stop codon counts(stop_codon_std)are used to improve the performance of the DNN model.Positive dastasets(lncRNA)comprised sequences and annotations are extracted from GENCODE and NONCODE databases,while negative datasets are downloaded from UCSC and zflncRNApedia.Tenfold cross-validation was used to train the model.Results manifest that our DNN model outperforms several popular methods,including CPC2,CPAT and CNCI,evaluated by metrics including sensitivity,specificity,accuracy,Matthews correlation coefficient(MCC)and area under receiver operating characteristic curve(AUC).In addition,we also provide specific prediction models for various species including human,mouse,rat,pig,chicken,zerafish,chimp and celegans respectively.The lncRScan-DNN proposed in this paper achieves a good performance with the use of integrated features and deep neural network algorithm,which can be used as a vital basis for lcnRNA analysis.
Keywords/Search Tags:genomic sequencing, long-noncoding RNA(IncRNA), protein-coding RNA, deep neural network(DNN)
PDF Full Text Request
Related items