Font Size: a A A

Study On Motif Discovery Algorithms For High-throughput Sequencing Datasets Based On Expectation Maximization

Posted on:2019-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:2428330572951574Subject:Engineering
Abstract/Summary:PDF Full Text Request
DNA motif discovery is to find a set of similar sequence fragments in a given set of DNA sequences,which helps to locate regulatory elements such as transcription factor binding sites.Transcription factors can be combined with specific sites in the upstream of the gene to control the transcription initiation and regulate transcription rate.These specific sites are called transcription factor binding sites.Therefore,the study of the motif discovery algorithm plays an important role in revealing the transcriptional regulation mechanism.In recent years,with the rapid development of high-throughput sequence technologies,ChIP-seq and other technologies can obtain transcription factor binding sites at the genome level,providing a large volume of experimental data for motif discovery.Expectation maximization(EM)is widely used to solve motif discovery problems.When dealing with small datasets,EM-based motif discovery algorithms can usually identify motifs efficiently and effectively,but large datasets generated by high-throughput sequencing technologies pose a challenge for EM-based motif discovery algorithms:huge computation time is required to process the entire dataset,but if only a small sample sequence set is processed,it may be impossible to identify the motifs with low occurrence frequency.For high-throughput sequence datasets,two parts of the work are carried out to design motif discovery algorithms based on EM.The first part of the work proposes the MDS~3 algorithm using the strategy of dividing sample sets and solving each of these sample sets separately.Firstly,it divides the input sequence set into multiple sample sequence sets,then refine the initial motifs in each sample sequence set by EM algorithm,and finally combine the results of all sample sequence sets.When generating the initial motifs for each sample sequence set,a method is designed by using the entire input sequence set,which helps to identify the motifs with low occurrence frequency.The experimental results show that MDS~3 has comparable identification accuracy with the existing algorithms(MEME-ChIP,F-Motif,and PairMotifChIP)but with better time performance,especially for large datasets;in particular,when the motif occurs in the dataset infrequently,MDS~3 outperforms the compared algorithms in both identification accuracy and time performance.The second part of the work designs an online motif discovery algorithm OMD based on online EM.It gets data blocks from the given input sequence set continually;for each data block,it uses the information of the previous data block to solve the current block;finally,a post processing is executed.When handling each data block,the closed solution(solve without information from historical data block)and the online solution(solve with information from historical data block)are combined,which can effectively avoid over-dependent on the new block.The results show that the identification accuracy of OMD is higher than the existing online motif discovery algorithm(EXTREME),and OMD can effectively identify the motifs with a low occurrence frequency or the motifs distributed unevenly.
Keywords/Search Tags:motif discovery, expectation maximination, high-throughput sequence datasets, transcription factor binding sites
PDF Full Text Request
Related items