Font Size: a A A

Research On DNA Assembly Algorithm Based On De Bruijn Graph

Posted on:2015-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:X Y WangFull Text:PDF
GTID:2298330452950107Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Whole genome DNA sequence assembly is one of the most important research topicsin the field of modern bioinformatics.With the widely used of the sequencing inindustry and the use o second generation of gene sequencing machine characterizedby high throughput, short sequences, high coverage,sequencing errors introduced, alarge number of gene sequencing data often for billions of base pairs need to beassembled, which gives existing gene-splicing technology a huge challenge.Currently,gene sequencing technology is mainly based on graph theory algorithms, and One ofthe most widely used algorithms is based on De Bruijn graph. For massive genesequencing data, building distributed De Bruijn graph and achievin parallelization ongene splicing,can effectively solve the problem of insufficient memory resourcescaused by calculating the amount of data, reasonably improve the assembly speed,shorten sequencing period of time, reduce the cost of sequencing cost,optimizequality of assembly,as in response to the requirements of large-scale applications.Therefore, based on De Bruijn graph, research on parallelization of large-scale genessequences data is of great theoretical and practical significance, and it is the hotspotof the study.In this paper, based on the biological gene sequence data as the research object,we have put forward a rapid, efficient and scalable parallel genetic sequence splicingalgorithm based on the De Bruijn graph, which adopt the method of graph theory andthe technology of parallel computing.The author’s main research works are as follows:(1) Introduced some basic concepts and related algorithms about the basiccalculation model of the gene sequencing technology (OLC graph model and the DeBruijn graph model), which focus on comparing JA algorithm that has the referencevalue.(2) Proposed an high efficient of speed and scalable sequence splicing approachbased on De Bruijn graph. In the process of constructing figure, This paper haveimproved the storage mode of bidirected De Bruijn patterning in YAGA algorithm,using less storage space.In the simplification of graph, this thesis remove some tips, which are less than2k, referring to the processing principle in velvet. And this papermake use of the depth-first-search algorithm (DFS) to reduce a large number ofcommunication and the movement of the computing nodes among the processors, aswell as to decrease the consumption of time and space in the process of splicing. Interms of data pretreatment, according to the principles of frequency, we remove nodesbelow the threshold, reduced the number of vertices in the graph and the wrongassemble due to sequencing errors, which optimize our results.(3) In performance testing phase of the algorithm, this paper employed fourgroupts of different size of genetic data to test and analyse the performance of thealgorithm from execution time, speed ratio and scalability. The experimental resultsshow that the algorithm proposed had highter accelerating ratio, and good scalability,especially for large-scale genome.
Keywords/Search Tags:Sequencing technique, Gene assembly, MPI parallel technology, DeBruijn graph
PDF Full Text Request
Related items