Study On Algorithms For Discovering Motifs In Large DNA Datasets Based On Word Count

Posted on:2019-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:D B Wei

Full Text:PDF

GTID:2428330572451515

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Motif discovery is to find conserved patterns in given DNA sequences,which is mainly used to locate transcription factor binding sites(TFBSs)and plays an important role in studying gene expression and regulation.Moreover,in higher eukaryotes,gene expression is often regulated by cooperating transcription factors and recognition of the corresponding transcription factor binding sites can be abstracted as structured motif discovery.Next-Generation Sequencing(NGS)enables locating of transcription factor binding sites at genome level.However,the resulting large DNA datasets are much larger than the traditional promoter sequence datasets.It brings new challenges for sloving motif discovery.Motif discovery can be formally defined as the quorum planted(l,d)motif search(q PMS)problem.A structured motif consists of two or more(l,d)motifs separated by a variable gap.Compared with traditional small datasets,large DNA datasets contain more motif occurences.According to the conservation of DNA motifs,the motif occurences are similar to each other,so substrings with high occurrence frequency in large datasets may be motif occurences.Based on this observation,two researches on motif discovery algorithms based on word count are carried out.The first research focuses on the acceleration of existing q PMS algorithms by sample sequence selection.First,analyze the effects of the number of input sequences t and the ratio of the sequences containing motif occurences q on the time performance of q PMS algorithms and find that a large t or a small q will cause a longer computation time.Thus,in order to improve the time performance of existing q PMS algorithms,sample sequence sets with a small t and a large q can be selected from the large input datasets.Based on this consideration,a sample sequence selection algorithm named Sam Select is proposed.Sam Select uses word count to obtain high-frequency substrings form the input sequences,and obtains sample sequence sets by clustering high-frequency substrings.The experimental results on both simulated and real data show that Sam Select can select the sample sequence sets in a short time and that the q PMS algorithms executed on sample sequence sets can find implanted or real motifs in a significantly shorter time than when executed on original sequence sets.The second research focuses on structured motif discovery algorithm in large DNA datasets.The single motif and its fragments in a structured motif may also occur multiple times in large datasets,so mining high-frequency substrings and processing high-frequency substrings with structured motif template can efficiently and effectively search structured motifs.Based on this consideration,a structured motif discovery algorithm named SMS is proposed.First,the algorithm adaptively selects w and k according to l and d of each single motif in the given structured motif template,and computes k mismatch count of all w-mers in the input sequences.Second,according to structured motif template,we use sliding window to scan the input sequences to obtain the peak substrings,which are expected to cover structured motif instance.Finally,the structured motif is obtained by aligning peak substrings.Compared with the existing algorithms on multiple datasets,the results show that the SMS algorithm can find structured motifs faster while keeping prediction accuracy close.

Keywords/Search Tags:

motif discovery, large DNA datasets, word count, transcription factor binding sites

PDF Full Text Request

Related items

1	Study On Motif Discovery Algorithms For High-throughput Sequencing Datasets Based On Expectation Maximization
2	An Approach For Recognition Of Transcription Factor Binding Sites Based On Genetic Algorithm
3	Research On Fast Motif Finding Methods Based On Heuristic Strategies
4	Efficient Large-Scale Machine Learning Algorithms for Genomic Sequence
5	Study On Clustering Of Position Frequency Matrices For Transcription Factor Binding Site
6	Die Body Similarity Comparison Algorithm Research
7	The Study Of Characterization And Prediction Of Binding Sites On Proteins Based On Machine Learning Methods
8	Multi-class Learning For Sequential Data
9	Applications Of Machine Learning Approaches To Biological Sequence Analysis
10	Study On Algorithms For DNA Sequence Motif Discovery Based On Gibbs Sampling