Font Size: a A A

High Throughput Sequence Mapping for Next Generation DNA Sequencing

Posted on:2012-05-18Degree:Ph.DType:Dissertation
University:Northwestern UniversityCandidate:Misra, SanchitFull Text:PDF
GTID:1460390011464064Subject:Engineering
Abstract/Summary:
Motivation. Next generation sequencing machines are generating millions of DNA sequences every-day with applications in Denovo sequencing, resequencing, ChIP-Seq, RNA-Seq, protein-protein interactions, metagenomics, etc. Mapping the reads generated by sequencers to a reference genomic sequence is an integral step in all of these applications. Depending on the underlying technology, the next generation sequencers produce short (30--100bp) or long (250 -- 1000bp) reads. Recently a number of programs have been proposed for mapping short reads to a reference genome. For long read sequence mapping, there are limited options; BLAT, SSAHA2 and BWA-SW are among the popular ones. However, many sequencers are already generating longer reads and more are expected to follow. Moreover, lack of ground truth in the form of real reads and their exact matching locations in the reference sequence makes it difficult to compare new sequence mapping algorithms with the existing ones. This dissertation aims at addressing these issues.;Results. We have developed FANGS (Fast Algorithm for Next Generation Sequencers) that can handle up to 1% errors in alignment and further improved it to develop AGILE (AliGnIng Long rEads) that can handle up to 10% errors in alignment. AGILE is a hash table based high throughput sequence mapping algorithm for longer 454 reads that uses diagonal multiple seed-match criteria, customized q-gram filtering and a dynamic incremental search approach among other heuristics to optimize every step of the mapping process. In our experiments, we observe that AGILE is more accurate than BLAT, and comparable to BWA-SW and SSAHA2. For practical error rates (< 5%) and read lengths (200 -- 1000bp), AGILE is significantly faster than BLAT, SSAHA2, and BWA-SW. Even for the other cases, AGILE is comparable to BWA-SW and several times faster than BLAT and SSAHA2. AGILE can map up to 630 million nucleotides to the human genome per hour. We have also developed scalable distributed memory implementations of FANGS and AGILE, called pFANGS and PAR-AGILE respectively. Using 512 processors, pFANGS and PAR-AGILE can map up to 15:5 billion nucleotides per hour and up to 150 billion nucleotides per hour respectively. In addition, we have developed a gold standard dataset consisting of real reads and all of their significant matches in the human genome. This dataset can be used to perform comparative studies on the prominent long read sequence mapping tools. Moreover, we have studied the energy consumption of GPU applications and specifically studied the effects of various optimization techniques over the energy consumption of cudasw++, an implementation of Smith-Waterman on GPU.
Keywords/Search Tags:Sequence, Generation, AGILE, BWA-SW, SSAHA2, BLAT, Reads
Related items