Font Size: a A A

Based On Reference And GPU-accelerated Compression Method Of FASTQ Files

Posted on:2019-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:C PengFull Text:PDF
GTID:2428330566961898Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the sequencing technique,DNA sequencing data is incoming at exponential growth.Data storage and transmission has become an urgent problem.Researchers focused on two kinds of DNA sequencing data compression technology,which are reference-based and reference-free.Reference-based compression tools usually sacrifice the performance of compression time for a better compression ratio.With the popularity of GPU devices and programming framework,data compression technology in combination with high performance computing becomes an effective way to solve this problem.In this dissertation,we propose two GPU-accelerated reference-based compression tools namely GACcomp and gFQZip.The main contributions of this work lie in:1.A reference-based compression tool called GACcomp is proposed for FASTQ files which use GPU-based arithmetic coding.GACcomp separates the three components of an input FASTQ namely metadata,base and quality scores,and follow in these steps: 1)uses template chains algorithm to simplify the metadata;2)uses sparse indexing algorithm which aligned between base and reference;and 3)compresses intermediate files(including simplified metadata,mapped information)with GPU-based arithmetic coding;The qulity scores is compressed by block sorting compressor.Experimental results indicate the(de)compression speed are increased,and the overall performance is improved.2.A GPU-acclerated reference-based compression for FASTQ files namely gFQZip is proposed.Similar to GACcomp,we implement the sparese indexing algorithm in GPU.All the intermediate files(including simplified metadata,mapped information and quality scores)are feed to the GPU-implemented compression module that mainly consists of Burrows-Wheeler-Transform,Move-to-front-Transform and Range encoding.Experimental results indicate that gFQZip can boost the speed of reference-based sequencing data compression while maintaining satisfactory compression ratio.gFQZip compresses much faster than the other reference-based methods and reaches up to ~16.8-fold speedup.In a word,this dissertation proposes novel GPU-based and reference-based compression methods for sequencing data.The new methods increase the speed of(de)compression with reasonable compression ratio and memory consumption.They are expected to serve as candidate solutions for relieving the stress brought by high throughput DNA sequencing.
Keywords/Search Tags:DNA compressor, Reference-based compression, Graphics Processing Units, GPU, GACcomp tool, gFQZip tool
PDF Full Text Request
Related items