Font Size: a A A

Integrating Genome Sequence And Protein Structural Information To Predict DNA Binding Sites Of Transcription Factors

Posted on:2021-05-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:P P LongFull Text:PDF
GTID:1360330605979013Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Transcription factors play important roles in gene expression regulation,while transcription factor binding sequences(TFBSs)are the basis of transcription factor specific regulation.Therefore,deciphering TFBSs is the first step towards theidentification of transcription factors' functionality.However,in the post-genomic era,withthe availability of amino acid sequence of a large number oftranscription factors,quick,accurate and bulk identification of the binding sequences of transcription factors as well as further functional analysis of transcription factors based on amino acid sequences has become a great challenge.In comparison with the large number of transcription factors within sequenced genomes(one transcription factor family contains several thousands of membersfrom different species),only a handful of transcription factors per family have been characterized.In other words,experimental analysis can provide more abundant,more detailed,and more definitive information than predictions based solely on amino acid sequences.Integrating some detailed information from experimental assays and sequence information from large number of family members has thus become an important project.In this paper,we have explored and verified a new approach combining detailed information from a few TF-DNA complexes and genome sequences to predict TFBSs of a large number of TFs.Specifically,we proposed a method that integrates genome-sequence information,protein-DNA complex data and statistical learning to predict the DNA specificity of TetR family regulators(TFR).In the first part of this work,we streamlined a model based on genome sequence and proposed a quantitative indicator P-value to measure the conservation of candidate TFBSs.The genome sequence-based model uses a similar idea from the phylogenetic footprinting to search for conserved DNA sequences within the upstream sequence,and calculates the enrichment of candidate TFBSs.Based on the streamlined genome model,we predict all members of TetR family,and then filter out reliable predictions based on their P-values.However,the genome sequence-based model highly relies on the upstream sequence and the number of homologous proteins which limited the application and accuracy of the model.Machine learning has been applied to the prediction of TFBSs in recent years.In principle,we can train a machine learning model based on the reliably predicted TFBSs from the genome sequence model.However,the training set of machine learning is a set where the amino acid and DNA sequences have been well aligned,respectively.Owing to the fact the predicted TFBSs are short and diverged,it is hard to align TFBSs solely based on the TFBS sequence.In the second part,we propose a method that applies structural information of TF-DNA complexes to align predicted TFBSs.First of all,the DNA sequences of complexes are aligned based on structures,then the DNA sequences of complexes are employed as seed DNAs to align TFBSs using a PSI-BLAST-like method.For each seed DNA,a group containing hundreds or thousands is obtained.Finally,groups are merged based on the structural alignment of seed DNAs and a unified TF-DNA dataset with diverse sequences are built.In the third part,a statistical energy function is trained based on the aligned amino acid sequence-DNA sequence dataset.The model contains two energy terms:a single energy term reflecting the probability basis of each amino acid(or nucleic acid)position,and a pairwise energy term reflecting the interaction between two different positions.This energy function can be used to evaluate any TFR-DNA pair.In order to illustrate the feasibility of the two methods,20 published and 10 newly generated experimental data of TFR were used to validate the models.29 out of 30 TFRs can be correctly predicted by the combination of genome-sequence based model or statistical energy method.The results show the reliability of theindicator P-value of the genome model and theindicator Z-score of the statistical energy model.Based on these indicators,we estimated that 59.6%of TFRs can be successfully covered by the combination of these two methods,while genome-sequence based method can only cover 28.7%ofmembers of TetR family.The binding profiles derived from the energy model were tested by high-throughput experimental assays.When energy Z-scores and GSS Z-scores are low,the energy profiles are consistent with the binding profiles from high-throughput experimental assays.Finally,we use the FootprintDB database to compare the prediction models and found that most of the predicted TFBSs cannot be searched in the FootprintDB database,which shows that our method can discover many new TF-DNA interactions.This prediction approach can be extended to other transcriptional factor families with sufficient structural information.
Keywords/Search Tags:Transcription
PDF Full Text Request
Related items