Font Size: a A A

Research On Parallel De Novo Assembly Based On De Bruijn Graph

Posted on:2016-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:L J ZhangFull Text:PDF
GTID:2428330542457396Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the successful completion of the human genome project,genomics has also started the post genomic era for gene structure and functional analysis.At the same time,the sequencing technology of genome is developing rapidly,accuracy and economy.How to achieve the sequence of the genome rapidly,high flux and low consumption is still a basic and an important link in genomics.The sequence data(read)of the new generation sequencing technology has the characteristics of large amount of data,short sequence length and low accuracy.The existing sequence assembly technology is not adapted to the above data features.Therefore,it is imperative to study the further research of the sequence of the new generation sequencing technology.At present,the sequence assembly algorithm based on de Bruijn graph is the main method in genome sequencing.This method uses the de Bruijn graph to store the gene sequences,and has the features of saving memory,high accuracy and high coverage.In this thesis,based on the new generation of sequencing technology,the genome sequencing problems from the beginning of a deeper study,and some research results are made,as follows:Firstly,the emergence,definition and development of bioinformatics were deeply investigated.The main techniques of genome sequencing and assembling were investigated.The principle based on de Bruijn graph sequence assembly algorithm and the corresponding software is deeply studied.Secondly,for characteristic of short sequences,high throughput and the amount of data of next-generation DNA sequencing data,we introduce the concept of decision table and the subsequent k-mer selection method,and optimalize sequence assembly algorithm based on de Bruijn graph.Thirdly,The Bruijn de graph sequence assembly algorithm based on MapReduce model is deeply investigated.Moreover,based on the proposed model,the specific method and the parallelization method of avoiding the block de Bruijn graph are proposed,the de Bruijn graph was constructed by the change of K values,and the maximum assembly efficiency was obtained,and the parallel de novo assembly program based on de Bruijn graph is realized.Finally,a large number of experiments are carried out and the experimental results are compared with the results of the existing algorithms.This thesis presents the optimization technology of sequence assembly algorithm for de Bruijn graph,which can improve the efficiency and accuracy of sequence assembly to some extent.The parallelization of sequence assembly algorithm based on de Bruijn graph,which is based on MapReduce model,improves the scalability of de novo algorithm,and greatly improves the speed of sequence assembly.Genome de novo sequencing method(de novo sequencing)does not use any reference sequence,and gets DNA sequences directly based on genome sequencing(reads).For new species of genome sequencing this method is the only way.The results of this study have some theoretical value and practical value for more accurate,rapid and high throughout DNA sequencing.
Keywords/Search Tags:de novo genomics, new generation sequencing, de Bruijn graph, de novo assembly, MapReduce parallel computing
PDF Full Text Request
Related items