Font Size: a A A

Algorithm Research On The Problem Of Transcription Factor Binding Sites Identification

Posted on:2015-02-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y P ZhangFull Text:PDF
GTID:1260330431462425Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The first step of gene expression, transcription, is the main step of generegulation. Through transcription factors binding the specific-DNA sequences, geneexpression can be suppressed or enhanced. Identifying the binding regions of DNAsequences, that is, the identification of transcription factor binding sites, has the greatsignificance with understanding the transcriptional activity of genes andcomprehending gene expression, which is one of the most widely researches inbioinformatics now.The difficults of transcription factor binding sites identification problem are thatthe motif signal with length more than tens is relatively short with comparing to alarge number of the background noise sequences with length hundreds or thousands ofbases, and the motif instances of the same transcription factor may also partiallymutate. Meanwhile, with the length and the number of sequences increasing, the sizeof solution space will increase rapidly, and the computational cost will becomeimpractical. In addition, identifying multiple transcription factor binding sites in thebinding region, finding the specific co-regulated transcription factor binding sites anddiscovering the genome-wide binding sites, are the great challenges faced by thisproblem. In this paper, our researches are aimed at the mathematical models,optimization techniques, efficient identification methods and the further developmentscombining with the new biological experiments. The proposed methods are applied tothe the string simulate data, the promoter sequences of different species and tissuesand genome-wide DNA data to identify transcription factor binding sites.(1) We propose a novel fixed-position projection refinement algorithm in orderto solve the large candidate solution space of traditional transcription factorbinding sites identification problem and alleviate the classic methods to fallinto local optimal. This algorithm divides the data into different subsetsthrough a projection process based on the corresponding probabilisticfrequency matrix, it filters the subsets with certain information score andcomplexity score, which are used as the initial condition for expectationmaximum refinement. Our algorithm achieves the different motif instancesdistribution in the model OOPS, ZOOPS and TCM by setting the threshold inthe fixed-position projection process. Meanwhile, using the high orderMarkov model as background can make the probability model more closer to the reality. In addation, our algorithm can be extended to a multiple motifsdiscovery version by using the similarity function. Experimental results showthat the algorithm can effectively identify transcription factor binding sites inthe promoter sequences in multiple eukaryotic species.(2) For the planted (l, d) motif search problem derived from the transcriptionfactor binding sites identification, traditional algorithms are hard to make thegoog trade-off between efficiency and accuracy, and also hard to solve thechallenging instances. We present a heuristic cluster based EM algorithm,CEM, which refines the cluster subsets in the modified EM method toexplore the best local optimal solution through setting the reference sequence.Combining the exact methods and the probabilistic methods, CEM overcomethe shortcomings of the traditional EM algorithm falling into different localsolutions. Our algorithm can accurately find the planted sites and has betterperformance on the high degenerated motifs. Simulation data experimentsshow that CEM can not only accurately identify the planted motif signals inthe practical instances but also has a better accuracy in the challenginginstances. In addition, the real data experiments show that the algorithm canbe effectively applied to real species for identifying the transcription factorbinding sites.(3) For identifying the genome-wide transcription factor binding sites, wepropose an algorithm, MMFChip, to identify the binding motifs oftranscription factors in ChIP-seq data. Our algorithm is motivated on thecharacteristics that the amount and quality of ChIP-seq data have beendramatically increased. Through the word enumeration strategy of countingthe occurrences of the subsequences in the positive and negative set,MMFChip combines the exact methods and the probabilistic methods, whichselects the relative enriched subsequences and searches the (l, d) instances ofeach enriched subsequence to constitute the corresponding Position WeightMatrixes. Then MMFChip employs the statistic model including theintra-motif dependency and high-order Markov model to give the motif amore precise statistic description, and utilizes the False Discovery Rate tocontrol the quality of outputs. Finally, MMFChip uses a post-processing tocluster the similar motifs. Applying MMFChip on the ChIP-seq datasets, theresults show that our algorithm can process the motif discovery problem in the large-scale data, the mutiple binding motifs and co-factors can beeffectively detected.
Keywords/Search Tags:Transcription factor binding sites, motif discovery, expectationmaximization, chromatin immunoprecipitation sequencing
PDF Full Text Request
Related items