Font Size: a A A

Identification Of Large Intergenic Non-coding Rna Based On Feature Selection

Posted on:2019-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:W N XuFull Text:PDF
GTID:2370330602968895Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Only 2%of the genes in the human genome can encode the protein transcription,and the rest are non-protein encoded transcripts.The results show that the more complex organ function,the higher the sequence content of the non-encoded protein.Non-coding RNA(ncRNA)plays a moderating role in several basic human processes.According to the length of the non-coding RNA,it is divided into 2 kinds:long non-encoding RNA(lncRNA)and small non-coding RNA(MicroRNA).LncRNA is associated with many biological processes,such as chromatin remodeling,cell differentiation and epigenetic regulation,and participates in the regulation of a variety of complex human diseases.According to the different positions of protein encoded genes and lncRNA in the genome,the lncRNA is divided into 4 categories:(1)sense lncRNA,duplicated with another transcription and exon on the same chain,(2)antisense lncRNA,complementary to another RNA(multiple-finger mRNA)sequence,(3)intronic lncRNA,transcribed from the intron(encoded protein gene),(4)lincRNA,transcribed from the position between two genes.LincRNA is one of the most representative lncRNA,which has been studied by more and more researchers.And many lincRNA have been experimentally confirmed to be related to tumor cell regulation.At present,the human genome contains more than 12,000 lincRNAs.Although lincRNAs have been discovered,accurately identifying lincRNAs still faces many difficulties.There are many methods for recognizing lincRNA,which can be roughly divided into 2 categories:(1)based on RNA-Seq sequencing methods,lincRNAs are identified and analyzed through library preparation and transcriptome reconstruction,but the sequencing methods are time-consuming and costly.(2)based on machine learning methods,the classifier is used to identify lincRNA on the basis of multidimensional features,but the specificity is not high.Therefore,it is necessary to construct bioinformatics models to accurately identify lincRNA by means of optimizing features.This paper aims to identify protein-coding transcripts(mRNA)and lincRNA by excavating significant differences between them.Firstly,based on the characteristics of minimum free energy(MFE)and signal-to-noise ratio(SNR),the 264-dimensional high-dimensional mixed feature set is constructed by combining the 4-mer sequence features;Then,features are quantizated,by using Matlab programming calculation SNR and other sequence features,and MFE are calculated by using RNAfold software.Finally,the experimental data set is constructed in accordance with the over-sampling and under-sampling methods to construct a balanced data set,and the random forest(RF)classifier is trained.In order to prove the superiority of the RF model,the classification models such as support vector machine(SVM),extreme learning machine(ELM)are constructed.The model draws the ROC curve and compares the effectiveness of the classification model with AUC value(the area under the ROC curve).The results show that the AUC value of the RF classification model is 0.922,which displays a good robustness in the recognition process.The test results based on the same dataset show that the sensitivity,specificity,and accuracy of the new method are 94.1%,93.2%,and 93.7%,respectively,and the RF classification model proposed in this paper can efficiently identify lincRNA.
Keywords/Search Tags:Long intergenic non-coding RNAs, Random Forest, Minimum Free Energy, Signal-noise Ratio, Machine learning
PDF Full Text Request
Related items