Research On DNA Assembly Algorithm Based On De Bruijn Graph

Posted on:2015-01-31

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Wang

Full Text:PDF

GTID:2298330452950107

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

Whole genome DNA sequence assembly is one of the most important research topicsin the field of modern bioinformatics.With the widely used of the sequencing inindustry and the use o second generation of gene sequencing machine characterizedby high throughput, short sequences, high coverage,sequencing errors introduced, alarge number of gene sequencing data often for billions of base pairs need to beassembled, which gives existing gene-splicing technology a huge challenge.Currently,gene sequencing technology is mainly based on graph theory algorithms, and One ofthe most widely used algorithms is based on De Bruijn graph. For massive genesequencing data, building distributed De Bruijn graph and achievin parallelization ongene splicing,can effectively solve the problem of insufficient memory resourcescaused by calculating the amount of data, reasonably improve the assembly speed,shorten sequencing period of time, reduce the cost of sequencing cost,optimizequality of assembly,as in response to the requirements of large-scale applications.Therefore, based on De Bruijn graph, research on parallelization of large-scale genessequences data is of great theoretical and practical significance, and it is the hotspotof the study.In this paper, based on the biological gene sequence data as the research object,we have put forward a rapid, efficient and scalable parallel genetic sequence splicingalgorithm based on the De Bruijn graph, which adopt the method of graph theory andthe technology of parallel computing.The author’s main research works are as follows:(1) Introduced some basic concepts and related algorithms about the basiccalculation model of the gene sequencing technology (OLC graph model and the DeBruijn graph model), which focus on comparing JA algorithm that has the referencevalue.(2) Proposed an high efficient of speed and scalable sequence splicing approachbased on De Bruijn graph. In the process of constructing figure, This paper haveimproved the storage mode of bidirected De Bruijn patterning in YAGA algorithm,using less storage space.In the simplification of graph, this thesis remove some tips， which are less than2k, referring to the processing principle in velvet. And this papermake use of the depth-first-search algorithm (DFS) to reduce a large number ofcommunication and the movement of the computing nodes among the processors, aswell as to decrease the consumption of time and space in the process of splicing. Interms of data pretreatment, according to the principles of frequency, we remove nodesbelow the threshold, reduced the number of vertices in the graph and the wrongassemble due to sequencing errors, which optimize our results.(3) In performance testing phase of the algorithm, this paper employed fourgroupts of different size of genetic data to test and analyse the performance of thealgorithm from execution time, speed ratio and scalability. The experimental resultsshow that the algorithm proposed had highter accelerating ratio, and good scalability,especially for large-scale genome.

Keywords/Search Tags:

Sequencing technique, Gene assembly, MPI parallel technology, DeBruijn graph

PDF Full Text Request

Related items

1	Research On Parallel De Novo Assembly Based On De Bruijn Graph
2	Sequencing Optimization Of Mixed Model Assembly Lines Based On Shuffled Frog Leaping Algorithm
3	Research On Assembly Method Based On Metagenomic Sequencing Data
4	Research On Genome Assembly Method Based On High Throughput Sequencing Data
5	Research On Biological High-Throughput Sequencing Fragment Assembly And Molecular Biomarker Detection Algorithms
6	The Method And Parallel Processing For Fragment Assembly In Whole Genome DNA Sequecing Project
7	Research On Balancing And Sequencing Problems Of Stochastic Assembly Line
8	Research On Mixed-model Assembly Line Sequencing Problem Based On Orders
9	Research On Automatic Assembly Of Parallel Mechanism Base On Reverse CAD Assembly Technology
10	DNA Sequence Assembly On Distributed-memory Environment