Font Size: a A A

Genomic Sequence Analysis On Heterogeneous CPU/GPU Platforms

Posted on:2019-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:N BaiFull Text:PDF
GTID:2370330626452393Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Genetic sequences are of great importance in bioinformatics and other fields.For example,we can obtain the biological evolution by multiple sequence alignment(MSA).In general,MSA compares one or more new genome sequences obtained by sequencing technology with other genomes which have been stored in the database.However,the current sequencing technology can only generate a large number of DNA short reads that need to be assembled for reconstructing the original whole genomes.Then these new genomes can be used to MSA algorithm.The rapid growth of biological datasets has brought a lot of challenges,and the powerful computing resource of GPU has been used to many computationally intensive tasks.In this paper,we study the genome assembly algorithm and the MSA optimization between multiple users in the heterogeneous CPU/GPU platform.Before these new generated short reads can be used to MSA,they have to be assembled by genome assembly algorithm to reconstruct the whole genome.The de Bruijn graph has been widely used in de novo genome assembly of short reads.But unfortunately,the large number of intermediate data make it challenging to construct the whole de Bruijn graph.A lot of intensive calculations also make the assembly time too long to accept.In this paper,we put forward a genome assembly system named GMSP,focusing on graph construction.Firstly,we use the Minimum Substring Partitioning(MSP)algorithm to partition genetic data.Secondly,use bloom filter to filter out invalid vertex and re-design hash table to store vertexes and edges.thirdly,in order to compress the I/O consumption,we encode the intermediate genetic data.fourth,we pipeline the data transfer and computation to improve the overall time performance.Experimental ruselts showed that GMSP achieved up to 25 times speedup.When some new whole genome sequences have been reconstructed,MSA algorithm can compare them with other genomes that have been stored in the databases.In recent year,there is a trend that many bioinformatic institutes setup a shared server for multiple users to submit MSA jobs.Given the fact that different MSA jobs often process similar datasets,there can be an opportunity for users to share their computation results,which can avoid the redundant computation.In this paper,we propose an efficient MSA system called SMSA for multi-users on shared heterogeneous CPU/GPU platforms.Data sharing is considered to accelerate the overall computation.Additionally,We also propose a scheduling strategy based on the similarity in datasets between MSA jobs.Furthermore,co-run computation model is adopted to take full use of computing resource.Experimental results showed that SMSA can achieve a speedup of up to 32 times.
Keywords/Search Tags:Genome assembly, MSA, Data sharing, MSP, Co-run computation, Hash table, CUDA, Pipeline
PDF Full Text Request
Related items