Genome Assembly Algorithm Based On Next-Generation Sequencing

Posted on:2016-12-15

Degree:Master

Type:Thesis

Country:China

Candidate:K Zhao

Full Text:PDF

GTID:2180330461486334

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

De novo assembly is a basic subject of Bioinformatics. It’s a difficult but very significative research topic. De novo assembly is such technology that using computer technology to rebuild DNA sequencing based on the given short-read sequening. The development of Next-generation Sequencing (NGS) technologies has revolutionized the genome research field and given rise to the explosive increase of Bioinformatics and Medicine. NGS can get millions of short-read sequencing one time and finish the complete genome sequencing in short time, but costs lower than traditional Sanger shotgun method. Currently, the biggest challenge to genome research is the continuing explosive growing genome sequencing can not get timely calculation and analysis, and follow-up research works are unable to carry out. So, the research of high performance genome assembly algorithm will play a key role in promoting Bioinformatics.Typically, the assembly algorithms that assembling the short sequencing generated by the NGS are all based on three categories of methods:the Greedy-extension strategy, the Overlap-layout appraoch and the De Bruijn graph approach. The De Bruijn graph approach needs to look for an Eulerian Path, and ideally, the complexity of Eulerian Path problem is linear. Another attractive property of De Bruijn graph is that repeats in the genome can be collapsed and do not lead to many spurious overlaps.In this paper, we present a new method that uses dynamic hash algorithm to build De Bruijn graph. Our hash funtion and colision resolution method are well designed, and the load factor is careful consideration, so the complexity of insert and search operation is very low. Search operation is a frequent operation in genome assembly, the reduction of search time will play an important role in improving algorithm peformance. While the dynamic extension of the hash table make our algorithm applied to assembly different sizes of genome data. Moreover, genome data are always very large, and as the input of assembly algorithm, the size of genome files can be hundreds of GB or even TB. In this case, the file I/O will take long time in assembly procedure. In order to overcome the file I/O bottleneck, we have used a parallelized file I/O scheme. At the beginning of the algorithm, the genome file is first divided evenly into a set of smaller sized subfiles and all subfiles are distributed evenly to all processes, then a parallelized file I/O scheme is used and all processes will communicate with each other to exchange linkage information. In order to make full use of our distributed-memory systems, based on the full analysis of De Bruijn graph, we have used both the shared-memory multi-threading and distributed-memory parallelism to implement the contig generation and scaffolding stages.Evaluations using three paired-end datasets and the Yoruban individual dataset show that, our algorithm achieves speedups up to 7 compared to two other well parallelized assemblers:ABySS and PASHA while delivering comparable accuracy on 64 CPU cores.

Keywords/Search Tags:

Genome Assembly, Next-generation Sequencing, De Bruijn graph, Dynamic Hash, MPI

PDF Full Text Request

Related items

1	The Implementation Of Metagenome Sequencing Assembly Based On De Bruijn Graph Algorithm
2	Scaling short read de novo DNA sequence assembly to gigabase genomes
3	Sequence Assembly Algorithms For Next-generation Sequencing Technology Research
4	DNA Contig Generation Algorithm Based On De Bruijn Graph
5	Research And Implementation Of Sequence Assembly Parallel Programming On Bi-directed De Bruijn Graph
6	Research On Genomic Reads Mapping Based On De Bruijn Graph Model
7	Practical Strategies Of Bacterial Genome Assembly:Hybird Assembly
8	Genome Assembly Guided By Reads
9	Research And Application Of The Third Generation Genome Assembly High Repeat Sequence
10	A Parallel And Optimized Algorithms For De Novo Short Read Assembly Using De Bruijn Graphs