Font Size: a A A

All-pair comparison of billion-base genome sequences

Posted on:2014-01-30Degree:M.SType:Thesis
University:Rensselaer Polytechnic InstituteCandidate:Li, HaiqiongFull Text:PDF
GTID:2450390005495349Subject:Computer Science
Abstract/Summary:
The LSH-ALL-PAIRS algorithm is an important method for comparison of genomic DNA sequences in order to find conserved genome features (i.e., subsequences of genomic sequence with d bases, called d-mers) across different species (e.g., Escherichia coli and Mycobacterium tuberculosis ). The algorithm is based on a randomized search technique called locality-sensitive hashing (LSH), which is first applied to all the d-mers. After the hashing step, d-mers with the same hash value are grouped together into a class and pair-wise comparison is performed to find similar d-mers up to a certain number of mismatched bases. Instead of performing pair-wise comparison on potentially very large number of the input d-mers, LSH-ALL-PAIRS algorithm divides the d-mers into classes and performs pair-wise comparison on each individual group. However, since the computational complexity for pair-wise comparison within each class is still O(N2), where N is the number of d-mers in each class, the sequential LSH-ALL-PAIRS algorithm cannot process very long genomic sequence with billions of bases. In this thesis, a parallel implementation of the LSH-ALL-PAIRS algorithm is proposed. The PARALLEL-LSH-ALL-PAIRS algorithm is based on the message passing interface (MPI) and runs on high performance servers, computing clusters, and supercomputers. The PARALLEL-LSH-ALL-PAIRS algorithm first uses a pool of processes to evaluate the hashing function on billions of genomic subsequences in parallel. Then, d-mers with the same hash value are grouped together and redistributed among all the processes using MPI communication. Finally, each process performs pair-wise comparisons of the assigned subsequences and outputs groups of similar pairs. Experiments show that the PARALLEL-LSH-ALL-PAIRS algorithm achieves good scalability with an increasing number of cores and increasing sizes of the input data on the RPI's IBM Blue Gene/Q supercomputer.
Keywords/Search Tags:LSH-ALL-PAIRS algorithm, Comparison, D-mers, Genomic
Related items