Font Size: a A A

Study On Algorithms For Cis-regulatory Modules Discovery Based On Extended HMM

Posted on:2018-10-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:H T GuoFull Text:PDF
GTID:1360330542492917Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cis-regulatory module(CRM)discovery is an important problem in computational biology.It is the basis for understanding the molecular mechanism of gene transcription regulation,and is also the key step to construct gene regulatory network.Moreover,it is also important for the study of the mechanism of diseases.CRM discovery is still a challenge.The reasons are as follows.CRMs are widely distributed in the regulatory regions of regulated genes,and some are far from the target gene even hundreds kilobase pairs.Motif sites within CRMs are short and degenerate,difficult to identify.The complex structure of CRMs in themselves further increases the difficulty of CRM discovery.These structural features include the number of motifs,the direction of motifs,the distance between motifs,and the order of motifs within a CRM.However,the internal mechanism that governs the organization of these features is not fully understood,and thus homologous CRMs in different genes often suffer from mutation and rearrangement.Therefore,it is difficult to characterize the regulatory structure of CRMs.Many methods have been proposed to predict CRMs.According to strategies used in these methods,they can be divided into the following categories: window clustering,probability modelling,discriminative modelling and phylogenetic footprinting.The probabilistic modelling methods based on hidden Morkov models(HMMs)are most common and effective in all methods.In this dissertation,from the perspective of improving HMM expression capacity,reducing the model's search space and avoiding over-fitting in modeling some features,new CRM discovery algorithms are proposed to further improve the prediction performance for CRMs along the path of HMMs.The specific works are summarized as follows.In the first part,in view of two important shortcomings of classic HMMs: 1)state duration is implicitly assumed to geometric distribution;2)observations are assumed mutually independence,limiting methods based on HMMs to identify performance,a probabilistic method called SMCis is proposed to address these problems.SMCis uses a hidden semi-Markov model(HSMM)to establish a CRM discovery model.Compared with general CRM discovery methods,SMCis considers the distance and sequence specificities of motifs within a CRM,rather than just views a CRM as a simple clustering of these motifs.The experimental results on three real biological data show that SMCis has better prediction performance.In the second part,in view of the fact that limited by computational power,cis-regulatory module discovery methods based on HMM,mostly used to identify promoters near the gene transcription initiation site,i.e.,short regulatory region sequences.However,more general CRMs,such as enhancers,are far from the transcriptional initiation site of the regulated gene.To identify such cis regulatory modules,CRM discovery algorithms need to search for large regulatory regions,i.e.,long regulatory sequences.To solve the task,we propose a new CRM discovery method called Seg HMC.Seg HMC construct a segmental HMM model for CRMs.To deal with the long regulatory region of eukaryotic genes,we reduce the search space by segmenting sequences before building the HMM and removing a large number of unnecessary search paths.Seg HMC can be used to identify CRMs in the target gene regulatory regions and even the whole genome.Moreover,Seg HMC does not view CRMs just as the combination of motifs,and introduces the frequency of motifs,the order preference and the distance distribution between motifs into CRM's regulatory grammas.These features can effectively improve the accuracy of CRM discovery.The experimental results on a simulated dataset and a real dataset show that Seg HMC has better performance than the compared methods on long regulatory region sequences.In the third part,most of CRM discovery algorithms consider correlations among all the motifs when modeling the inter-motif dependencies,which not only introduces a large number of parameters to be estimated,but also may lead to model overfitting.In view of this,we propose a CRM discovery algorithm called Com SPS.In modeling the dependence of motif,Com SPS only considers the correlation between motif pairs that frequently co-occur in given sequences,thus significantly reducing the number of parameters to be estimated.Moreover,Com SPS makes full use of the given information and gives a more systematic data processing.Specifically,firstly,input position weight matrices(PWMs)will be filtered according to the quality of given PWMs.Then,based on filtered PWMs(or directly using the given PWM),a HMM is constructed to model the regulatory structure of CRMs on the input sequences;the model is trained by Baum-Welch algorithm;based on the trained model,the Viterbi algorithm is used to infer positions of potential CRMs in sequences.Finally,the model furtherly screen CRMs found and output conserved CRMs.The experimental results on three public benchmark datasets show that Com SPS performs better than the compared methods.
Keywords/Search Tags:cis-regulatory module discovery, motif, gene transcriptional regulation, HMM
PDF Full Text Request
Related items