Font Size: a A A

DNA Contig Generation Algorithm Based On De Bruijn Graph

Posted on:2012-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2210330362451438Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In order to explore the essence of life, people are eager to get all the bases on DNA molecules of new species quickly. Now, a new generation of sequencing technology is developing on and on, but in the process of sequencing, DNA has been broken into fragments. So de novo sequencing is proposed. As the second generation, third generation sequencing technology generated, people can get a lot of sequence data in a relatively short period of time. The sequencing technology is developing in the direction of high-throughput, low cost and high-precision, which leads to more and more sequencing data are accumulated. How to quickly and accurately handle the massive sequencing data has become the bottleneck of the development of DNA sequencing.Before sequencing, DNA must be copied, smashed and filtered, and then the sequencer read DNA fragments out. These DNA fragments are called reads. First, we must generate contigs from reads. A contig is such a DNA fragment that is ovoerlaped by reads, and each base is covered by a read. In this paper, we propose a new de novo sequencing algorithm, called SRGA. This algorithm is based on de Bruijn graph, and it transforms the de novo sequencing problem into a quadtree search problem. In this algorithm heuristic search strategy is used, so it is able to deal with mass sequencing data quickly and the contigs generated from the algorithm have high quality.In this paper, the principle and the realization of the algorithm is described detailedly. In order to save a great deal of reads, a new de Bruijn graph is used. Decision table is essential for heuristic rules. Assembling reads are saved in the decision table, and these reads decide which k-mer is the subsequent k-mer. The decision table needs to be update when a subsequent k-mer is chosen. We choose subsequent k-mer repeatedly to extend contig. So choosing the subsequent k-mer is an important process. If and only if the right subsequent k-mer is chosen can we get the contig with high quality. Here locked read strategy is proposed. Some reads which are about success can be locked through set the locked bit in the decision table and then the locked reads will decide which k-mer is the subsequent k-mer. This ensures that each base on the contig can be covered by a successful assembling read.At last, we compare SRGA with EULER. EULER produces shortet contigs, and each contig can match reference genome well, but overall coverage is low. SRGA produces longer contigs, and uses less time and memory. Although each contig can not match reference genome well, yet overall coverage is high.
Keywords/Search Tags:de novo sequencing, heuristic search, de Bruijn graph, decision table
PDF Full Text Request
Related items