The Study Of Reference Sequence Selection For Motif Searches

Posted on:2016-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:R X Zhao

Full Text:PDF

GTID:2348330488974113

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In a set of biological sequences, there always exists some special short sequence fragments that have significant biological functions, i.e. the transcription factor binding sites in DNA sequence set. These sequence fragments are similar but not identical to each other, which are called motifs. Researchers usually use computational method to identify motifs in the biological sequences. This computational process is what the motif search problem we study in this paper. The motif search problem is one of the most important and challenging problems in both bioinformatics and computer science.There has been lots of algorithms to solve the planted(l, d) motif search(PMS) problem since it was first proposed by Keich and Pevzne. Of all the PMS algorithms, we mainly focus on the exact ones in this paper. They can find all the motifs through traversing the whole search space. The main indicator to assess exact algorithms is time performance. Researchers usually compare exact algorithms on the challenging PMS problem instances. The pattern-driven PMS algorithms have the best time performance in identifying linear motifs or long weak motifs in all of the exact PMS algorithms. Their basic idea is using k out of the t input sequences as reference sequences to generate candidate motifs, and then verifying each of the candidate motifs. In this way, they can identify all the motifs of the input sequences.However, most of the pattern-driven PMS algorithms fixedly regard the first k sequences in the input as reference sequences without elaborate selection processes. We find in our research that different reference sequences may lead to quite different number of candidate motifs, especially for large alphabets. So, in dealing with different inputs with the same scale, the pattern-driven PMS algorithms may exhibit sharp fluctuations in running time.In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelct make the reference sequences we choose correspond to a small number of candidate motifs. Experimental results show that RefSelect(1) makes the tested algorithms qPMS7, TraverStrR and PMS8 solve the PMS problem steadily in an efficient way,(2) particularly, makes them achieve a speedup of up to about 100� on the protein data, and(3) is also suitable for large data sets which contain hundreds or more sequences.

Keywords/Search Tags:

planted(l,d) motif search problem, pattern-driven, reference sequences

PDF Full Text Request

Related items

1	Research On Double-block Planted Motif Finding Algorithm Based On Frequent Pattern Mining
2	Subtle Motif Discovery Algorithms
3	Research On The Algorithms And Applications For The Motif Discovery Problem
4	Parallel sorting and motif search
5	Study On Algorithms For DNA Sequence Motif Discovery Based On Gibbs Sampling
6	Behavior Analysis And Probability Control Strategy Of Local Search Algorithm For Solving SAT Problem
7	Evolutionary Algorithm For Motif Discovery
8	Algorithms On Motif Finding And Closest String Problems
9	Study On Sequential Motif Discovery Algorithms Based On Subgraph Density
10	The Research On The Heuristic Search Algorithms For Solving The Combinatorial Optimization Problems