Font Size: a A A

The Indentification Of Non-coding RNA Based On Machine Learning Algorithm

Posted on:2017-08-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:C PiaFull Text:PDF
GTID:1310330518980186Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
More and more research shows that non-coding RNAs (ncRNAs) are involved in a variety of important biological function in the cells, including the control of chromosome dynamics, RNA splicing, RNA editing, translational inhibition and mRNA destruction and so on. Moreover, an increasing number of studies show that lots of ncRNAs have crucial and essential regulatory function. Based on the analysis of human transcriptomes, we know that about 70% of human genomes are transcripted into ncRNAs, while the protein-coding transcripts (PCT) only occupy 2%-3% of the total genomes. According to the size of transcripts, ncRNAs fall into two categories: short and long ncRNAs (lncRNAs). Short ncRNAs include small nucleolar RNAs (snoRNAs), microRNAs (miRNAs), piwi-interacting RNAs(piRNAs), short-interfering RNAs (siRNAs) and short hairpin RNAs (shRNAs). In this paper, we mainly study the identification of miRNAs, piRNAs and lncRNAs.Also we further explore the prediction of the relation between miRNAs and diseases and some innovative achievements are obtained. My thesis contains the following three parts:(1) Use Extreme Learning Machine (ELM), which is a new machine learning algorithm,to study the recognition of miRNA precursors.MicroRNAs(miRNAs) are endogenous non-coding RNAs that can play an important role in gene regulation. MiRNAs can regulate the process of biological life such as growth, development, and apoptosis by complete or incomplete pairing of their target genes to degraded target genes or inhibit their expression. MiRNAs have been verified to be associated with various diseases such as cancers. Therefore,identifying miRNAs accurately is the first step to study their function. In general, the experimental methods are very time consuming and expensive. Also it has no use for the expression which are either low or in the specific environment. In this paper, we optimize the 32 dimension local contiguous structure sequence features (Triplet).Since pre-miRNAs have a stem-loop structure, if we record features of each base on the whole chain, that will cause the information redundancy. Since pre-miRNAs have a stem-loop structure. In order to reduce the time complexity and information redundancy, we come up with the bidirectional sliding window (BSW) method to extract features, and we also obtain the 8-dimensional optimization local contiguous structure sequence features (OP-Triplet). Combining with the MFE and structural diversity, we get a 10-dimensional feature vector. Compared with 32 dimensional features, this 10 dimensional feature vectior tremendously reduce the information redundancy, and improve the accuracy and efficiency of the algorithm. We also introduce a novel machine learning algorithm called extreme learning machine (ELM).The results indicate our method is significantly effective compared with the Triplet-SVM-classifier and MiPred classifier. Furthermore, we compare the results generated by the SVM, Triplet-SVM-classifier, MiPred classifier (RF) together with our ELM method based on the optimal features. We conclude that both ELM method and optimal features contribute to the prediction accuracy.(2) Use the integrated extreme learning machine algorithm to accurately identify the human piRNAs.Piwi-interacting RNAs (piRNAs) are a novel class of small RNAs isolated from the mammalian germline cells. The length of piRNA is around 19-33 nucleotides,mostly locates in the range of 26-33 nucleotides. By interacting with the Piwi proteins,piRNAs then form a ribonucleoprotein complex called Piwi-interacting RNA complex(piRC), which has been extracted and purified from rat testes. PiRNAs can also protect the genome of animal germ cells from the action of transposable elements which can cause DNA damage and sterility. In addition, some recent research indicates that piRNAs may also play an important role in the cancer. In this thesis, we come up with a new method and use a hybrid feature vector to identify human piRNAs. To do so, we propose a series of new features with 80 dimension called Short Sequence Motifs (SSM). A hybrid feature vector with 1444 dimension can be formed by combining 1364 features of k-mer strings and the 80 SSM features.However, not every feature contributes to the classification accuracy. Therefore, we optimize the 1444 dimension feature vector by the feature score criterion (FSC). That means we calculate the FSC score of each feature and rank them in descending order.The first 400 information features are selected by experimental validation and we use them as the input feature vector in the V-ELM classifier. Therefore, the information redundancy originated from invalid features and the complexity of the algorithmic training are effectively reduced. Meanwhile, we also introduce a novel machine learning algorithm called Voting based extreme learning machine (V-ELM). Using V-ELM, we can correctly predict the samples whose positions are close to the classification boundary. The result shows that our method is more effective compared with those of piRPred and piRNApredictor.(3) Identify long non coding RNAs (lncRNAs) based on the random forest algorithm.As the major part of eukaryotic transcriptomes,lncRNAs have been verified to be associated with various diseases such as cancers, heart failure, AIDS.LncRNADisease database was constructed by Chen et al., and it contains more than 1000 lncRNA-disease entries, including 321 lncRNAs and 221 diseases from nearly 500 publications. Therefore, the identification and annotation of lncRNAs are crucial to understand various regulatory mechanisms. In this paper, we introduce three new features, including MaxORF, RMaxORF and SNR. A new hybrid feature with 89 dimension can be formed by combining 86 sequence features and the former 3 features together. However, not every feature contribute to the classification accuracy.So we optimize the 89 dimensional features using the feature score criterion (FSC).The first 30 features of FSC are selected as the input vector of the classifier. Besides,a RF classifier model is constructed to discover new lncRNAs. Robustness is an advantage of RF model, since it can be used to build the ensemble of trees by randomly selecting features. The accuracy of a RF classifier highly depends on the selection of training samples. In order to choose representative samples to construct training dataset, we use Self Organizing Feature Map (SOM) to select the training dataset. Finally, we provide a highly reliable and accurate tool called LncRNApred. It can identify the IncRNAs from thousands of assembled transcripts accurately and quickly. Moreover, using LncRNApred, we can also predict protein-coding potential of transcripts. The results indicate that our LncRNApred outperforms CPC. Therefore,we believe that V-ELMpiRNAPred is a valuable tool for the study of lncRNA and protein-coding transcripts.
Keywords/Search Tags:Support vector machine, Extreme learning machine, MiRNA, PiRNA, LncRNA
PDF Full Text Request
Related items