Genomic Sequence Analysis On Heterogeneous CPU/GPU Platforms

Posted on:2019-03-05

Degree:Master

Type:Thesis

Country:China

Candidate:N Bai

Full Text:PDF

GTID:2370330626452393

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Genetic sequences are of great importance in bioinformatics and other fields.For example,we can obtain the biological evolution by multiple sequence alignment(MSA).In general,MSA compares one or more new genome sequences obtained by sequencing technology with other genomes which have been stored in the database.However,the current sequencing technology can only generate a large number of DNA short reads that need to be assembled for reconstructing the original whole genomes.Then these new genomes can be used to MSA algorithm.The rapid growth of biological datasets has brought a lot of challenges,and the powerful computing resource of GPU has been used to many computationally intensive tasks.In this paper,we study the genome assembly algorithm and the MSA optimization between multiple users in the heterogeneous CPU/GPU platform.Before these new generated short reads can be used to MSA,they have to be assembled by genome assembly algorithm to reconstruct the whole genome.The de Bruijn graph has been widely used in de novo genome assembly of short reads.But unfortunately,the large number of intermediate data make it challenging to construct the whole de Bruijn graph.A lot of intensive calculations also make the assembly time too long to accept.In this paper,we put forward a genome assembly system named GMSP,focusing on graph construction.Firstly,we use the Minimum Substring Partitioning(MSP)algorithm to partition genetic data.Secondly,use bloom filter to filter out invalid vertex and re-design hash table to store vertexes and edges.thirdly,in order to compress the I/O consumption,we encode the intermediate genetic data.fourth,we pipeline the data transfer and computation to improve the overall time performance.Experimental ruselts showed that GMSP achieved up to 25 times speedup.When some new whole genome sequences have been reconstructed,MSA algorithm can compare them with other genomes that have been stored in the databases.In recent year,there is a trend that many bioinformatic institutes setup a shared server for multiple users to submit MSA jobs.Given the fact that different MSA jobs often process similar datasets,there can be an opportunity for users to share their computation results,which can avoid the redundant computation.In this paper,we propose an efficient MSA system called SMSA for multi-users on shared heterogeneous CPU/GPU platforms.Data sharing is considered to accelerate the overall computation.Additionally,We also propose a scheduling strategy based on the similarity in datasets between MSA jobs.Furthermore,co-run computation model is adopted to take full use of computing resource.Experimental results showed that SMSA can achieve a speedup of up to 32 times.

Keywords/Search Tags:

Genome assembly, MSA, Data sharing, MSP, Co-run computation, Hash table, CUDA, Pipeline

PDF Full Text Request

Related items

1	A Design Of Short Gene Sequence Alignment Acceleration System Based On High Performance Hash Table
2	Genome Assembly Algorithm Based On Next-Generation Sequencing
3	Parallel Optimization For Whole Genome Re-sequencing Sequences Analysis Pipeline
4	Fast Convexhull Computation Parallel Design And Implementation Based On CUDA
5	Whole Microbial Genome Assembly And Analysis Based On Ion Torrent Sequencing Data
6	Construction And Application Of The City's Comprehensive Pipeline Database
7	Development And Application Of DNA Methylation Data Analysis Software
8	The Establishment Of Conventional Genome Analysis Pipeline For Plants With Complex Polyploid Genome
9	Efficient Distributed Large-scale Genome Sequence Assembly
10	Parallel and Cloud Computing Based Genome Assembly using Bi-directed String Graphs