The Prediction And Assessment Of RNA Secondary Structure Using Comparative Sequence Analysis

Posted on:2008-05-29

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Y Fang

Full Text:PDF

GTID:1100360278456525

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

It has been understood that ncRNAs are important and main functional molecules as well as proteins since more and more ncRNAs are found or identified. The prediction of RNA secondary structure is the essential way and central foundation for identifying and understanding ncRNAs. Therefore, the studying of methods for predicting RNA secondary structure is very important in sicence.The best and widely used methods for predicting RNA secondary are all based on comparative sequence analysis. In these methods, the input for the algorithm is a set of RNA sequences or an alignment of multiple RNA sequences, and the target of the algorithm is to compute the optimized secondary structure common to all sequences. However, five intractable problems exist in the methods for predicting RNA secondary structure based on comparative sequence analysis: (1) how to reduce computing complexity of the algorithm without leading to decreased accuracy of the prediction? (2) How to devise methods for predicting secondary structure using biological knowledge or heuristic procedures? (3) How to construct high-quality and high-precision structural alignment of multiple RNA sequences for improving the accuracy of the predicting algorithm? (4) How to introduce more detailed reference information such as evolutionary information to better the prediction of RNA secondary structure? (5) How to obtain highly precise and highly credible results of predicting RNA secondary structure by assessing the predicted secondary structures? In this dissertation, we go deep into the problems mentioned above, design and implement corresponding solutions, test and evaluate the proposed algorithms on corresponding data sets. The major content and innovation of this work are:(1) The theory of position matrix and position vector.The position matrix presented in this research is a special nÃ—n matrix, where n is the length of the RNA sequence or RNA alignment. There are two kinds of position matrices: the position matrix for single RNA sequence and the position matrix for alignment of multiple RNA sequences. As for the former, the elements of the matrix are composed of 0, 1 and -1. The regions of continuous base pairs (i.e. the stem) in the RNA sequence can conveniently and exactly be identified by detecting the regions of continuous non-zero in the rows of the matrix. As for the latter, the elements of the matrix are composed of 0 and 1. The regions of conserved continuous base pairs (i.e. the conserved stem) in the RNA alignment can conveniently and exactly be identified by detecting the regions of continuous "1" in the rows of the matrix. The position vector presented here is a special vector of n dimensions, where n is the length of the RNA sequence or alignment. There are two kinds of position vectors: the position vector for single RNA sequence and the position vector for RNA alignment of multiple sequences. The position matrix records all the possible folding of the RNA sequence or multiple RNA sequence alignment. The position vector records the detailed secondary structure of the RNA sequence or multiple RNA sequence alignment under some folding. Theoretic analysis and experimental results show that the theory mentioned above can be efficiently applied to solving some corresponding problems about RNA secondary structure prediction.(2) The method for assessing RNA secondary structure using Signal-to-Noise.In this document, different assessing algorithms for different problems are proposed by taking the stems which are the basic building blocks of the RNA secondary structure as objects to be modeled. In summary, there are two kinds of assessing algorithms proposed in this research: the algorithms for assessing stems in single RNA sequence and the algorithms for assessing conserved stems in the RNA alignment. As for the former, the Signal-to-Noise is computed on the basis of base pairs in the stem. As for the latter, the Signal-to-Noise is computed on the basis of so-called column pairs in the conserved stem. Experimental results show that both of them can efficiently improve the methods for solving corresponding problems.(3) The method for detection and assessment of RNA secondary structure using multiple sequence alignment.The key for identifying ncRNA is to detect its secondary structure. Here we take the RNA alignment as input, use comparative sequence analysis, the theory of position matrix and position vector, and the method of Signal-to-Noise to devise the algorithm for detecting and assessing RNA secondary structure on the basis of detection and assessment of conserved stems. The theoretic analysis and experimental results show that our method is better than both QRNA and ddbRNA which are both popular methods for predicting RNA secondary structure at present. Compared with QRNA, our method has lower computing complexity, higher sensitivity and can be used to RNA alignment of more than two sequences. Compared with ddbRNA, our method has higher both sensitivity and specificity, and can be used to gapped RNA alignment.(4) The method for RNA secondary structure prediction using position matrix and position vector.This is the direct applying of the theory of position matrix and position vector to RNA secondary structure prediction. First, a heuristic method for predicting RNA secondary structure is proposed based on the "seed-expanded" idea. Second, a combined method for predicting RNA secondary structure is proposed based on detection and assessment of conserved stems. For each of the proposed methods, we implement it as two different algorithms according to different inputs (the alignment of multiple RNA sequences or the set of unaligned RNA sequences). For each of the implemented algorithms, we test it and analyze the performance of it. The experimental results suggest that both of the proposed methods are better than RNAalifold when the input is the RNA alignment, and both of them are better than Mfold when the input is the set of unaligned RNA sequences.(5) The method for constructing structural alignment of RNA sequences using position matrix and position vector.The key for RNA secondary structure prediction using comparative sequence analysis is constructing high-quality structural alignment of RNA sequences. In this research, a new method for building structural alignment of RNA sequences is proposed based on detection and assessment of conserved stems, using the theory of position matrix and position vector and Signal-to-Noise as basic approaches, the idea of "seed-expanded" as basic strategy, and the set of unaligned RNA sequences as input. In this thesis, the problem of structural alignment of RNA sequences is first introduced and then a new method for constructing high-precision structural alignment of multiple RNA sequences is described in detail. And finally the testing and analyzing of the method is provided in the thesis. The experimental results show that our method is overwhelmingly better than Clustal W which is a popular method for multiple sequence alignment at present.(6) The method for predicting RNA secondary structure using profile stochastic context-free grammars and phylogenic analysis.Evolutionary information is very important reference in the analysis of biological sequences. In this research, a new method for predicting RNA secondary structure based on Profile SCFG and phylogenic analysis is presented by integrating more complicated evolutional information of homologous sequences with the prediction of secondary structure. First, a new Profile SCFG is defined for modeling RNA alignment and its consensus secondary structure. Then, two different HMMs are defined for respectively modeling structural regions or non-structural regions in the RNA sequences. Finally, a new probabilistic model for computing the optimized consensus secondary structure is proposed by integrating the HMMs into the Profile SCFG. The method presented here and the Pfold are respectively tested on the data sets built from Rfam database. Experimental results show that our method is better than Pfold, especially when the input alignment contains more sequences and more gaps, and has lower sequence conservation.

Keywords/Search Tags:

Non-coding RNA, RNA secondary structure, Position matrix, Position vector, Conserved stem, Seed-Expanded, Multiple sequence alignment, Stochastic context-free grammars, Phylogenic analysis

PDF Full Text Request

Related items

1	Research On Weighted Sequence Similarity Algorithm Based On K-MER Position Information
2	Research On Prediction Method Of Long Non-coding RNA Based On Position Weight Matrix
3	Study Of Several Algorithms For Alignment Problem Of Sequence And Sequence Secondary Structure
4	Identifying E.coli And Human Promoter Based On Sequence Informarion And Structure Informarion
5	The Applications Of Probabilistic Methods And Context-free Grammars In Permutation Statistics
6	Evolutionary Analysis Of Bacteria And Virus And Identification Of Eukaryote Coding Region Based On New Algorithm NAAKV
7	A Discussion On The Systematic Position Of The Genus Poona
8	The Machine Learning Model Of Protein Structural Prediction Based On Protein Sequence
9	Permutation Statistics And Chen’s Context-free Grammars
10	The Study Of Combinatorial Sequences By Using Grammars And Probabilistic Methods