Font Size: a A A

Prediction Of Srnas In E.coli Using Transcriptional Termination Signals Or Conservation

Posted on:2012-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiuFull Text:PDF
GTID:2214330371963009Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
Bacterial sRNAs are an emerging class of regulatory small RNAs widely existing in bacteria, 40~500nt in length. They are mainly located in the intergenic regions. However, some sRNAs can also be found in the 5' or 3' UTR (un-translated regions) of protein-coding regions. The significant difference between sRNAs and other non-coding RNAs such as tRNA and rRNA is that sRNAs have heterogeneity in length and that there are no conservative secondary structures among sRNAs from the same bacteria genome.The current studies show that the sRNAs take part in a variety of biological processes to response to the environmental changes, through binding to their target mRNAs or proteins. These processes include plasmid replication, phage development, stress response, quorum sensing, bacterial virulence, iron homeostasis and so on. Furthermore, the studies associated with sRNA identification were carried out only on a limited number of genomes such as E.coli among more than 1000 sequenced genomes. There are a large number of sRNA genes to be discovered. Therefore, the study related to the sRNA identification is very important.However, there are some disadvantages for genome-wide sRNA identification using experimental protocols. There disadvantages include time-consuming, labour-extentive, and some sRNAs expressed only in specific condition. Currently, the strategy combining bioinformatics prediction and experimental confirmation is often applied to sRNA detection. Therefore, it is very important to develop models for prediction of sRNAs, which can speed up the processes of sRNA identification. Moreover, the completion of large number of sequenced genomes and some RNA databases provide the basis for developing bioinformatics models.Compared to the protein-encoding genes with distinguished features, sRNA-encoding genes often do not have definite features, and are immune to the effect of frame-shift or non-sense mutation. It is necessary to develop specific bioinformatics models, which can be classified into three categories, comparative genomics-based, transcription signal-based or machine learning-based models.The theoretical basis of comparative genomics-based sRNA prediction is to assume that the potential sRNA genes should keep sequence conservation, and that the secondary structures of the transcripts of sRNA genes should be also conservative. Although the methods are often applied in the field of sRNA identification, it cannot find the species-specific sRNA genes. Secondly, the information for closely-related genomes should be available. Thirdly, the conservative intergenic regions may contain other types of gene structures such as transcription factor binding sites (TFBS) for undiscovered short proteins. Finally, the sRNA genes cannot be found if they are located in the antisense strands of the protein-encoding genes.The basic process for prediction of sRNAs using transcription signal is to seek the potential promoters or TFBSs and Rho-independent terminator structures. The shortcoming is that the prediction methods often have high false positive value derived from the high false positive value from promoter or TFBS prediction, and that the method cannot be applied to predict sRNA genes with Rho-independent terminator structures.The basic assumption of applying machine learning methods to the prediction of sRNA genes is that there is a significant difference between sRNA gene sequences and other parts in genomes. During the construction of sRNA prediction models using machine learning methods, the sequences are often divided into a fixed-length fragment such as 50nt. However, the typical characteristic of sRNA genes is the heterogeneity in length; it is very difficult to choose the optimal sequence windows.To address the above shortcomings and provide better support for experimental identification of sRNA genes, here we presented two models for prediction of sRNA genes in E.coli , which are based on transcriptional termination signal and sequence conservation, respectively. The basic assumption of transcriptional termination signals-based model is that the specific sequence and structure patterns flanking sRNA genes should be formed. Moreover, these patterns are not randomly distributed along the genome sequence. Through the careful inspection of known sRNA genes in E.coli, we found that the pattern signal for 5'end was weak, and the pattern signal for 3'end was relatively strong. Then, the pattern signal for 3'end, described by base frequency matrix, was applied to predict sRNA genes. Through the training dataset containing 63 positive samples and 100000 randomly-generated negative samples, we found the optimal cut-off value 28.9524, which is the sequence score calculated from the base frequency matrix. The sensitivity, specificity and positive prediction value respectively were 34.92%, 100.00% and 100.00% on the training dataset, 4.30%, 99.99% and 90.90% on the test dataset composed of 22 positive and 10000 negative samples. The conservation-based model presented here is to assume that the sRNA genes should be conservative in closely-related bacteria and that the Rho-independent structures of sRNA should also be conservative. According to this assumption, we firstly searched all conservative Rho-independent structures at least occurring two times among 39 genomes closely associated with E.coli. Then, the fragments from the upstream of Rho-independent structures were extracted and analyzed using blast program. The fragments would be taken as potential parts of sRNA genes if they contained at least 20nt continuously conservative in more than 7 genomes among 39 genomes. Finally, we got 335 potential sRNA genes with the sensitivity 32.30% and specificity 94.40%.The model's specificity is comparable to that from sRNAPredict2. However, the sensitivity of our model is more than 12% than that from sRNAPredict2, indicating a big improvement in sRNA prediction using sequence conservation and Rho-independent terminator structure.
Keywords/Search Tags:bacterial sRNA, prediction, transcriptional termination signal, conservative
PDF Full Text Request
Related items