Algorithms for next-generation sequencing

Posted on:2009-10-17

Degree:Ph.D

Type:Dissertation

University:Stanford University

Candidate:Sundquist, Andreas

Full Text:PDF

GTID:1444390002494132

Subject:Biology

Abstract/Summary:

Thirty years ago, Sanger first introduced the gel electrophoresis method for sequencing DNA. Since then, technology has improved to the point where we are adding over 10 billion bases to GenBank each year at a cost of less than 0.1 cents per base. Amazingly, we are now entering an era of even more dramatic sequencing growth thanks to next-generation technologies that will completely dwarf all previous efforts. Although the cost and speed of sequencing will improve by orders of magnitude, the characteristic short read length of such technologies creates new challenges in effectively using the data. In this dissertation, I describe three significant algorithmic contributions I have made for next-generation sequencing: (1) whole-genome short-read sequencing and assembly, (2) bacterial flora-typing using targeted short-read sequencing, and (3) ancestry inference using dense SNP arrays.Next, I describe the genomic study of microbial communities using next-generation sequencing. I present a methodology for phylogenetic classification based on short, 16S rDNA gene sequence reads and apply the technique to reads obtained via high-throughput Pyrosequencing. I then examine our ability to classify reads at different levels in the phylogeny and discuss limitations of the technique and the effects of read-length and targeting specific 16S variable regions using simulation.Finally, I present HAPAA (HMM-based Analysis of Polymorphisms in Admixed Ancestries), a methodology for inferring the ancestry of chromosomal blocks using dense SNP arrays. I describe how our method improves upon previous techniques by modeling the long-range patterns of haplotypic variation seen in populations due to linkage disequilibrium. Finally, to study the effect of genetic divergence between populations on ancestry inference methods, I will present a testing methodology we devised that constructs synthetic populations and tests on individuals with varied genetic histories.As DNA sequencing technology evolves, it will continue to open up opportunities for new computational approaches for understanding our genetics. The algorithms I present address three such opportunities that exist today with next-generation sequencing.First, I present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that uses short-read technologies to decipher complex, mammalian-sized genomes. Our sequencing protocol is based on a variation of hierarchical clone-based sequencing that is optimized for high-throughput implementation using current technologies. We assemble the genome through a series of algorithms that first determines clone ordering in-silico , then performs error correction, and finally assembles localized sets of reads in three stages of hierarchically larger regions. By benchmarking our method on large simulations of the human genome, I demonstrate that it is possible to perform fast and truly inexpensive de novo sequencing of mammalian genomes.

Keywords/Search Tags:

Sequencing, Algorithms

Related items

1	Algorithms for next-generation sequencing
2	Approximation algorithms for sequencing problems
3	Study On Detection Algorithms For Tumor Genomic Copy Number Alterations Based On Next-Generation Sequencing
4	Efficient Algorithms for Human Genetic Variation Detection using High-throughput Sequencing Techniques
5	Algorithms for high-resolution positron emission tomography
6	Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study
7	Algorithms for Determining Differentially Expressed Genes and Chromosome Structures From High-Throughput Sequencing Data
8	Algorithms for inverting Hodgkin-Huxley type neuron models
9	Study On The Molecular Mechanism Of Fanconi Anemia By Whole Exome Sequencing
10	OOPSI: A family of optimal optical spike inference algorithms for inferring neural connectivity from population calcium imaging