Font Size: a A A

The Study Of Reference Sequence Selection For Motif Searches

Posted on:2016-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:R X ZhaoFull Text:PDF
GTID:2348330488974113Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In a set of biological sequences, there always exists some special short sequence fragments that have significant biological functions, i.e. the transcription factor binding sites in DNA sequence set. These sequence fragments are similar but not identical to each other, which are called motifs. Researchers usually use computational method to identify motifs in the biological sequences. This computational process is what the motif search problem we study in this paper. The motif search problem is one of the most important and challenging problems in both bioinformatics and computer science.There has been lots of algorithms to solve the planted(l, d) motif search(PMS) problem since it was first proposed by Keich and Pevzne. Of all the PMS algorithms, we mainly focus on the exact ones in this paper. They can find all the motifs through traversing the whole search space. The main indicator to assess exact algorithms is time performance. Researchers usually compare exact algorithms on the challenging PMS problem instances. The pattern-driven PMS algorithms have the best time performance in identifying linear motifs or long weak motifs in all of the exact PMS algorithms. Their basic idea is using k out of the t input sequences as reference sequences to generate candidate motifs, and then verifying each of the candidate motifs. In this way, they can identify all the motifs of the input sequences.However, most of the pattern-driven PMS algorithms fixedly regard the first k sequences in the input as reference sequences without elaborate selection processes. We find in our research that different reference sequences may lead to quite different number of candidate motifs, especially for large alphabets. So, in dealing with different inputs with the same scale, the pattern-driven PMS algorithms may exhibit sharp fluctuations in running time.In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelct make the reference sequences we choose correspond to a small number of candidate motifs. Experimental results show that RefSelect(1) makes the tested algorithms qPMS7, TraverStrR and PMS8 solve the PMS problem steadily in an efficient way,(2) particularly, makes them achieve a speedup of up to about 100× on the protein data, and(3) is also suitable for large data sets which contain hundreds or more sequences.
Keywords/Search Tags:planted(l,d) motif search problem, pattern-driven, reference sequences
PDF Full Text Request
Related items