Font Size: a A A

Genome Assembly Algorithm Based On Next-Generation Sequencing

Posted on:2016-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhaoFull Text:PDF
GTID:2180330461486334Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
De novo assembly is a basic subject of Bioinformatics. It’s a difficult but very significative research topic. De novo assembly is such technology that using computer technology to rebuild DNA sequencing based on the given short-read sequening. The development of Next-generation Sequencing (NGS) technologies has revolutionized the genome research field and given rise to the explosive increase of Bioinformatics and Medicine. NGS can get millions of short-read sequencing one time and finish the complete genome sequencing in short time, but costs lower than traditional Sanger shotgun method. Currently, the biggest challenge to genome research is the continuing explosive growing genome sequencing can not get timely calculation and analysis, and follow-up research works are unable to carry out. So, the research of high performance genome assembly algorithm will play a key role in promoting Bioinformatics.Typically, the assembly algorithms that assembling the short sequencing generated by the NGS are all based on three categories of methods:the Greedy-extension strategy, the Overlap-layout appraoch and the De Bruijn graph approach. The De Bruijn graph approach needs to look for an Eulerian Path, and ideally, the complexity of Eulerian Path problem is linear. Another attractive property of De Bruijn graph is that repeats in the genome can be collapsed and do not lead to many spurious overlaps.In this paper, we present a new method that uses dynamic hash algorithm to build De Bruijn graph. Our hash funtion and colision resolution method are well designed, and the load factor is careful consideration, so the complexity of insert and search operation is very low. Search operation is a frequent operation in genome assembly, the reduction of search time will play an important role in improving algorithm peformance. While the dynamic extension of the hash table make our algorithm applied to assembly different sizes of genome data. Moreover, genome data are always very large, and as the input of assembly algorithm, the size of genome files can be hundreds of GB or even TB. In this case, the file I/O will take long time in assembly procedure. In order to overcome the file I/O bottleneck, we have used a parallelized file I/O scheme. At the beginning of the algorithm, the genome file is first divided evenly into a set of smaller sized subfiles and all subfiles are distributed evenly to all processes, then a parallelized file I/O scheme is used and all processes will communicate with each other to exchange linkage information. In order to make full use of our distributed-memory systems, based on the full analysis of De Bruijn graph, we have used both the shared-memory multi-threading and distributed-memory parallelism to implement the contig generation and scaffolding stages.Evaluations using three paired-end datasets and the Yoruban individual dataset show that, our algorithm achieves speedups up to 7 compared to two other well parallelized assemblers:ABySS and PASHA while delivering comparable accuracy on 64 CPU cores.
Keywords/Search Tags:Genome Assembly, Next-generation Sequencing, De Bruijn graph, Dynamic Hash, MPI
PDF Full Text Request
Related items