Font Size: a A A

MicroRNA Precursor Identification Based On The Secondary Structure

Posted on:2016-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:L Y FangFull Text:PDF
GTID:2310330503986909Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Micro RNA(abbreviated mi RNA) is a non-coding RNA molecules(about 20-24 nucleotides long) with regulatory functions. They are mainly involved in the regulation of transcriptional and post-transcriptional gene expression. Aberrant expression of micro RNAs has been implicated in numerous disease states, but the mechanism of micro RNA is still unclear. Therefore, the identification of micro RNA is fundamentally important for the basic researches in biology and therapeutic schedule for micro RNA. In the post-genomic era, with the avalanche growth of RNA sequences, the requirements for computational methods on micro RNA identification based on the sequence information become more eager. The research conducted in-depth studies on the micro RNA identification problem, extracting effective features mainly from the perspective of the secondary structure of micro RNA sequence and combining machine learning methods, natural language processing technology to build predictive models, the specific contents are as follows:Firstly, the research proposed the concept of the pseudo secondary structure status composition(Pseudo Structure Status Composition, Pse SSC), and applied it to the micro RNA identification problem. It designed an improved scheme incorporating the secondary structure sequence information of RNA and meanwhile the global RNA sequence information to extract features of the RNA sequences, according to the shortcomings of the existing methods that considering only the primary structure sequence information of RNA. With this feature extraction method, the RNA sequences are transformed into feature vectors, and then the support vector machine is adopted to construct classifiers to identify micro RNAs. The prediction accuracy obtained on the benchmark dataset is 85.76%, which outperforms the state-of-the-art method in this field.Secondly, the research proposed a "secondary structure status distance pair"-based method, mi RNA-dis. This method improved the Pse SSC, for the latter can not characterize the micro RNA sequences enough because that the Pse SSC ignores the distance property of different distance-based secondary structure status pairs. The experimental results on benchmark data set show ed that the mi RNA-dis outperforms the other comparative methods in the perspective of prediction accuracy(88.92%) and computational efficiency. In addition, the feature weight analysis also shows that the mi RNA-dis proposed in the research can portray the secondary structu re of micro RNA sequences effectively. In order to take full advantage of the global sequence information of secondary structure, considering the characteristics of the secondary structure status distance pairs meanwhile, the research also propose d a predictive method based on "pseudo secondary structure status distance pair composition"(Pse DPC). The experimental results on benchmark data set show ed that Pse DPC method's accuracy outperforms that of the Pse SSC by 1.93 percentage points, for the latter just takes the global secondary structure seq uence information into account.Thirdly, the research proposed a predictive method based on the "Gapped n-tuple secondary Structure Status Composition"(GSSC). This method introduced the concept of "gap" to solve the vector sparsity problem in traditional n-gram method characterizing more global information with a larger n. Thus, the problem that the n-gram-based approach is too sensitive to the noise data when n is large is eased. In order to improve the computational efficiency to handle large-scale data, the research designed an optimized kernel, using the tree data structure and a series of approximation strategies to improve kernel's computational efficiency. The experimental results on benchmark dataset show ed that the method outperforms the state-of-the-art method with predictive accuracy 86.91% and AUC value of 0.941, and also outperforms the Pse SSC method, which considers just the global sequence information of the RNA secondary structure, by 1.15 percentage points in prediction accuracy.Fourthly, the research proposed an ensemble learning predictive method using the weighted voting strategy. This method outputs a comprehensive predictive result on whether the input sequence is a micro RNA sequence or not by combining the predictive results of the four methods. In order to analyze the feasibility of using these 4 methods as the basi c classifiers, the research discussed the complementarity of the 4 methods via experiments. The experimental results on benchmark dataset showed that the prediction performance for micro RNA of the ensemble learning method is significantly improved compared to that of the each one of the 4 classifiers. In order to verify the stability of the model's prediction performance and the usability of the model in different species, the research experimented on the independent test set and the cross species data set separately, and all the experiments achieved good results.
Keywords/Search Tags:micro RNA identification, pseudo secondary structure status composition, secondary structure status distance pair, gapped n-tuple secondary structure status composition, support vector machine
PDF Full Text Request
Related items