| A growing body of research indicates that non-coding RNAs play important biological functions in cells,including controlling chromosomal replication,processing and modification of RNA,inhibiting translation and silencing of mRNA,and the like.Non-coding RNAs differ from encoding RNAs.Non-coding RNAs do not encode proteins.Previously,non-coding RNAs have been considered "junk DNA" and "dark matter",but a large number of studies have shown that many non-coding RNAs have important regulatory functions.A large-scale sequencing of human transcripts revealed that approximately 70% of the human genome is transcribed into non-coding RNA,while protein-coding transcripts account for only 1.5% of all human genomes.Non-coding RNAs can be broadly classified into short non-coding RNAs(eg,miRNA,piRNA,siRNA,shRNA)and long non-coding RNA(lncRNA)depending on the length of the non-coding RNA transcript.In addition,non-coding RNAs also have a special class of circRNA molecules,which are non-coding RNA molecules of a closed loop structure.Non-coding RNA molecules The regulatory networks involved can influence key physiological processes such as human development,evolution,genetic variation,and various diseases.Therefore,in-depth study of non-coding RNA may reveal an RNA-mediated genetic information expression regulatory network,thus providing new ideas for the study of human physiological processes.At present,although biological experiments can accurately identify non-coding RNAs,biological experimental methods require harsh experimental environments and extremely low levels of sample expression,so biological methods are no longer applicable.With the development of a new generation of high-throughput sequencing technology,human RNA genomes have been sequenced one after another,how to effectively use bioinformatics methods to identify non-coding RNA from human RNA genome into a research hotspot of RNAomics.In this paper,we mainly study the prediction of three non-coding RNAs,lncRNA,miRNA and circRNA.Based on the comparison and analysis of machine learning related algorithms,the main research work of this paper is determined:(1)In-depth study of the application of integrated algorithms in machine learning in non-coding RNA prediction,analysis and comparison of the principles and performance of various integrated algorithms.In the three non-coding RNA predictions,three integrated algorithms are used to compare with multiple machine learning algorithms.After comparing and analyzing the predicted results,the three integrated algorithms have the best prediction effect,so three integrated algorithms are selected as the best model.(2)In order to improve the prediction accuracy of three non-coding RNAs in advance,the three integrated algorithms in the previous experiments were added to feature screening.After adding feature screening and comparing the prediction accuracy of original features,feature screening improved the prediction of three non-coding RNAs accuracy.(3)This experiment extracted three types of RNA features,including open reading frame(ORF),base combination frequency and k-mer.Using the out-of-bag data(OOB)of random forests to estimate the importance of each feature,it was confirmed that ORF and k-mer contributed the most to the prediction of lncRNA,confirming the base combination frequency and the maximum contribution of k-mer to the prediction of pre-miRNA.It was confirmed that ORF and k-mer contributed the most to the prediction of circRNA.Although there are few types of extracted features in this paper,these three types of RNA features can effectively keep the prediction rate of three non-coding RNAs at a high level,and the three types of sequence features have a significant effect on the improvement of classifiers. |