Font Size: a A A

Algorithms for next-generation sequencing

Posted on:2009-10-17Degree:Ph.DType:Dissertation
University:Stanford UniversityCandidate:Sundquist, AndreasFull Text:PDF
GTID:1444390002494132Subject:Biology
Abstract/Summary:
Thirty years ago, Sanger first introduced the gel electrophoresis method for sequencing DNA. Since then, technology has improved to the point where we are adding over 10 billion bases to GenBank each year at a cost of less than 0.1 cents per base. Amazingly, we are now entering an era of even more dramatic sequencing growth thanks to next-generation technologies that will completely dwarf all previous efforts. Although the cost and speed of sequencing will improve by orders of magnitude, the characteristic short read length of such technologies creates new challenges in effectively using the data. In this dissertation, I describe three significant algorithmic contributions I have made for next-generation sequencing: (1) whole-genome short-read sequencing and assembly, (2) bacterial flora-typing using targeted short-read sequencing, and (3) ancestry inference using dense SNP arrays.Next, I describe the genomic study of microbial communities using next-generation sequencing. I present a methodology for phylogenetic classification based on short, 16S rDNA gene sequence reads and apply the technique to reads obtained via high-throughput Pyrosequencing. I then examine our ability to classify reads at different levels in the phylogeny and discuss limitations of the technique and the effects of read-length and targeting specific 16S variable regions using simulation.Finally, I present HAPAA (HMM-based Analysis of Polymorphisms in Admixed Ancestries), a methodology for inferring the ancestry of chromosomal blocks using dense SNP arrays. I describe how our method improves upon previous techniques by modeling the long-range patterns of haplotypic variation seen in populations due to linkage disequilibrium. Finally, to study the effect of genetic divergence between populations on ancestry inference methods, I will present a testing methodology we devised that constructs synthetic populations and tests on individuals with varied genetic histories.As DNA sequencing technology evolves, it will continue to open up opportunities for new computational approaches for understanding our genetics. The algorithms I present address three such opportunities that exist today with next-generation sequencing.First, I present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that uses short-read technologies to decipher complex, mammalian-sized genomes. Our sequencing protocol is based on a variation of hierarchical clone-based sequencing that is optimized for high-throughput implementation using current technologies. We assemble the genome through a series of algorithms that first determines clone ordering in-silico , then performs error correction, and finally assembles localized sets of reads in three stages of hierarchically larger regions. By benchmarking our method on large simulations of the human genome, I demonstrate that it is possible to perform fast and truly inexpensive de novo sequencing of mammalian genomes.
Keywords/Search Tags:Sequencing, Algorithms
Related items