Font Size: a A A

Research On Distance Conservation Of Transcription Regulatory Motifs And Prediction Of Transcription Start Sites In Human Genome

Posted on:2009-09-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LvFull Text:PDF
GTID:1100360245987011Subject:Theoretical Physics
Abstract/Summary:PDF Full Text Request
To understanding the interaction network among transcription-regulation elements in human is an immediate challenge for modern molecular biology. Here a central problem is how to extract evolutionary information and search the evolutionary conservation from the comparison of promoters of closely-related species. Through the comparative studies of k-mer distribution in human and mouse transcription factor binding site (TFBS) sequences we have discovered that the average distance between a pair of transcription regulatory 7-mer motifs is conservative in human-mouse promoters. The distance conservation is a new kind of evolutionary conservation, not based on the strict location of bases in genome sequence. By utilizing the conservation of k-mer distance it will be helpful to propose a non-alignment based approach for fast genome-wide discovery of transcription regulatory motifs. We demonstrated the distance conservation by genome-wide searching of conservative regulatory 7-mer motifs with successful rate 90%. Then, after defining human-mouse pair distance divergence parameter we studied the tissue-specific motif pairs and found that the parameter for motif pairs is 11 to 16 times smaller than for their controls for 28 tissues and these pairs can be clearly differentiated on 2-dimensional parameter plane. Finally, the mechanism of distance conservation was discussed briefly which is supposed to be related to the module structure of TFBSs.The accurate identification of promoter sequence and transcription start site is a challenge to the construction of human transcription-regulation networks. The novel method is highly necessary for improving the prediction.We used the method of Increment of Diversity with Quadratic Discriminant analysis (IDQD) to predict the transcription start sites (TSS). In typical TSS set prediction both sensitivity and positive predictive value have achieved a value higher than 65% with positives/negatives ratio 1:58. The performance evaluations by using Receiver Operator Characteristics (ROC) and Precision Recall Curves (PRC) were carried out, which give area under ROC(auROC) higher than 96% and area under PRC(auPRC)≈26% for positives/negatives ratio 1:679, 64% for postives/negatives ratio 1:113. In whole genome searching we made prediction on alternative-promoter-less and alternative-promoter-containing TSSs in chromosomes 4, 21 and 22 and obtained auROC =93% and auPRC =40% for positives/negatives ratio 1:138 and auROC =97% and auPRC =65% for positives/negatives ratio 1:68. The work shows the IDQD method is capable of solving complicate classification problems in bioinformatics.The implementation of IDQD algorithm, datasets and online-only supplementary data are available at the web site http://jichubu.imut.edu.cn/IDQD/idqd.htm.
Keywords/Search Tags:Distance conservation, Transcription regulatory motifs, Heptamer motifs, Tissue-specific motif pairs, Increment of diversity, Quadratic discriminant analysis, Transcription start site, Human promoters
PDF Full Text Request
Related items