High performance computational biology algorithms

Posted on:2011-08-05

Degree:Ph.D

Type:Dissertation

University:University of Illinois at Chicago

Candidate:Saeed, Fahad

Full Text:PDF

GTID:1448390002466758

Subject:Biology

Abstract/Summary:

Multiple Sequence s Alignment (MSA) of biological sequences is a fundamental problem in computational biology due to its critical significance in wide ranging applications including haplotype reconstruction, sequence homology, phylogenetic analysis, and prediction of evolutionary origins. The MSA problem is considered NP-hard and known heuristics for the problem do not scale well with increasing number of sequences. On the other hand, with the advent of new breed of fast sequencing techniques it is now possible to generate thousands of sequences very quickly. For rapid sequence analysis, it is therefore desirable to develop fast MSA algorithms that scale well with the increase in the dataset size. In this dissertation, we propose a novel domain decomposition based technique to solve the multiple sequence alignment problem on multiprocessing platforms. The domain decomposition based technique, in addition to yielding better quality, gives enormous advantage in terms of execution time and memory requirements. The proposed strategy allows to decrease the time complexity of any known heuristic of O(N)x complexity by a factor of O(1/ p)x, where N is the number of sequences, x depends on the underlying heuristic approach, and p is the number of processing nodes. In particular, we propose a highly scalable algorithm, Sample-Align-D, for aligning biological sequences using Muscle system as the underlying heuristic. In this dissertation, we also develop a highly scalable parallel algorithm based on domain decomposition, referred to as P-Pyro-Align, to align large number of reads from single or multiple reference genomes obtained from pyrosequencing procedure. The proposed alignment algorithm accurately aligns the erroneous reads in a short period of time. The proposed algorithms have been implemented on a cluster of workstations using MPI library. We report high quality multiple alignment of up to 0.5 million reads with our analysis suggesting that up to 10 million or more reads can be aligned using our parallel algorithm. The algorithms are shown to be highly scalable and exhibits super-linear speedups with increasing number of processors.

Keywords/Search Tags:

Algorithm, Highly scalable, MSA, Sequences, Problem, Alignment

Related items

1	Probabilistic computational methods for structural alignment of RNA sequences
2	Reconstructing Truncated Sequences Derived From Primitive Sequences Over Integer Residue Rings
3	Multiple Sequences Alignment Based On A-Star And DiAlign Algorithms
4	Research And Application Of Ant Colony Optimization Algorithm For Maximum Clique Problem
5	Parameter advising for multiple sequence alignment
6	Biological Sequence Alignment Problem
7	A block-based scalable motion model for highly scalable video coding
8	Towards highly reliable and scalable distributed systems
9	Multiple Structural Alignment Of RNA Sequences Based On Stem Fragments
10	The Algorithms Research Of An Consensus Path To Multiple Alignment For DNA Sequences