DNA Contig Generation Algorithm Based On De Bruijn Graph

Posted on:2012-06-14

Degree:Master

Type:Thesis

Country:China

Candidate:X Wang

Full Text:PDF

GTID:2210330362451438

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In order to explore the essence of life, people are eager to get all the bases on DNA molecules of new species quickly. Now, a new generation of sequencing technology is developing on and on, but in the process of sequencing, DNA has been broken into fragments. So de novo sequencing is proposed. As the second generation, third generation sequencing technology generated, people can get a lot of sequence data in a relatively short period of time. The sequencing technology is developing in the direction of high-throughput, low cost and high-precision, which leads to more and more sequencing data are accumulated. How to quickly and accurately handle the massive sequencing data has become the bottleneck of the development of DNA sequencing.Before sequencing, DNA must be copied, smashed and filtered, and then the sequencer read DNA fragments out. These DNA fragments are called reads. First, we must generate contigs from reads. A contig is such a DNA fragment that is ovoerlaped by reads, and each base is covered by a read. In this paper, we propose a new de novo sequencing algorithm, called SRGA. This algorithm is based on de Bruijn graph, and it transforms the de novo sequencing problem into a quadtree search problem. In this algorithm heuristic search strategy is used, so it is able to deal with mass sequencing data quickly and the contigs generated from the algorithm have high quality.In this paper, the principle and the realization of the algorithm is described detailedly. In order to save a great deal of reads, a new de Bruijn graph is used. Decision table is essential for heuristic rules. Assembling reads are saved in the decision table, and these reads decide which k-mer is the subsequent k-mer. The decision table needs to be update when a subsequent k-mer is chosen. We choose subsequent k-mer repeatedly to extend contig. So choosing the subsequent k-mer is an important process. If and only if the right subsequent k-mer is chosen can we get the contig with high quality. Here locked read strategy is proposed. Some reads which are about success can be locked through set the locked bit in the decision table and then the locked reads will decide which k-mer is the subsequent k-mer. This ensures that each base on the contig can be covered by a successful assembling read.At last, we compare SRGA with EULER. EULER produces shortet contigs, and each contig can match reference genome well, but overall coverage is low. SRGA produces longer contigs, and uses less time and memory. Although each contig can not match reference genome well, yet overall coverage is high.

Keywords/Search Tags:

de novo sequencing, heuristic search, de Bruijn graph, decision table

PDF Full Text Request

Related items

1	A Parallel And Optimized Algorithms For De Novo Short Read Assembly Using De Bruijn Graphs
2	Scaling short read de novo DNA sequence assembly to gigabase genomes
3	The Implementation Of Metagenome Sequencing Assembly Based On De Bruijn Graph Algorithm
4	Research On Genomic Reads Mapping Based On De Bruijn Graph Model
5	RNA Spatial Structure Prediction Based On Heuristic Search Strategy
6	Study On Peptide Sequencing Method Based On Graph Neural Networks
7	The Research Of Parallel Processing Techniques Of Proteomic Spectra Big Data
8	De novo peptide sequencing using a hidden Markov model
9	Research On Optimizations For Graph Edit Distance
10	De Novo Sequencing Using MS/MS Data From Ion Trap