Design And Implementation Of A Parallel DNA Sequences Mapping System Based On MPI

Posted on:2015-11-23

Degree:Master

Type:Thesis

Country:China

Candidate:H Li

Full Text:PDF

GTID:2180330422492275

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the advent of sequencing technology, massive numbers of DNA short sequences are produced. Although these high-throughput sequences greatly promote the advance of life science, but also pose a new challenge to short sequences mapping tools. In recent years, mapping tools such as BWA, Bowtie and mrsFAST are developed, however, they still have difficulty in meeting the requirements in terms of accuracy, mapping time and memory cost. This paper makes a further study in DNA sequences mapping and proposes two strategies: MPI based parallel mapping method and sorting based short read mapping algorithm, which will effectively solve the mapping problem of over than100GB DNA short sequences.The MPI based parallel mapping method includes the following steps. Firstly, the master node transmits sequences to the corresponding node and creates hash index. Secondly, every node calls exact mapping algorithm and transmits the mapping results to the master node. Finally, every node redistributes the unmapped sequences and calls inexact mapping algorithm. This method can set arbitrary number of nodes and threads to map sequences, and also transmits asynchronous data when running mapping algorithm, which can reduce the time cost for parallel transmission.The sorting based DNA sequences mapping algorithm consists of sectional sorting algorithm, exact mapping algorithm, and inexact mapping algorithm. The exact mapping algorithm traverses these sorted sequences quickly and gets the mapping results. The inexact mapping algorithm processes most sections of sequence like exact mapping and only searches base error on the rest. By using the sectional sorting results, this method reduces the number of inexact mapping times.In order to test the practical effect of mapping algorithm and parallel method, this paper carries out several related experiments under Linux operating system and MPI parallel environment. The results show that the proposed mapping algorithm is more efficient than traditional ways when the number of error is limited. The proposed parallel method makes effective use of computing resources and greatly improves the mapping speed.This MPI based parallel mapping system not only handles large amount of data fast, but also requires low memory cost, which means a good performance and broad applicability.

Keywords/Search Tags:

second-generation sequencing, short read mapping, hash index, parallelprogramming with MPI

PDF Full Text Request

Related items

1	Design And Optimization Of High-Performance Algorithms For Processing Biological Sequence Data
2	The Study On Read Alignment Algorithm For High-throughput Sequencing Datasets
3	Researches Of Short Sequence Alignment And Scaffold Algorithm Based On Next Generation Sequencing
4	Researches On Long Read Alignment Algorithms Oriented To The Third Generation Sequencing Technology
5	Optimizing High-throughput Biological Gene Sequencing Data Processing Algorithms Based On Hash
6	An Alignment Algorithm For DNA Short Reads Based On The Hamming Distance
7	The Mapping And Anlysis Tool Of The Third Generation RNA-seq File
8	Algorithms and Applications in Genome Assembly using Long Read Sequencing Technology
9	A Design Of Short Gene Sequence Alignment Acceleration System Based On High Performance Hash Table
10	Statistical Analysis Method For Quick Mapping Of QTLs Using Next-Generation Sequencing Technology