Parallelized Fast Compression Method Of High-throughput DNA Sequencing Data

Posted on:2019-10-08

Degree:Master

Type:Thesis

Country:China

Candidate:Q J Deng

Full Text:PDF

GTID:2428330566461897

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The advent of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing.Although the price of disk has been declining quickly these years,the increasing volume of raw data remains a hard problem to solve.By using effective compression methods to store DNA sequencing data,it can effectively reduce the storage space and the occupancy rate of transmission bandwidth.In the first part,this dissertation introduces the background and current status of the research on DNA sequencing data,including the development of sequencing technology,the storage format of DNA sequencing data,the development of DNA sequencing data compression technique and the existing related works.Afterward,two new compression methods called LW-FQZip 2 and KMCompress are proposed.Their performance is demonstrated in the comparison with the other state-of-the-art DNA sequencing data compression technologies.LW-FQZip 2 is an improved reference genome-based lossless compression method based on LW-FQZip 1.LW-FQZip 2 uses a parallel light-weight mapping model to match high-throughput sequencing short reads to a given reference genome.Then it uses prediction by partial matching model and arithmetic coder to compress mapping results and other data and achieves more effective coding and parallel computing performance.Experimental studies are conducted on both short read data and long read data generated by various sequencing platforms.The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs.KMCompress is a reference-free lossless compression method.Firstly,it reconstructs the input data rapidly and groups similar short-read/long-read data together.Secondly,it uses finite-context prediction model and arithmetic coder to estimate probability and encodes the sequencing data,which can effectively reduce the information entropy that needs to be recorded.To some extent,KMCompress overcomes the disadvantages of the reference-based compression method for it does not rely on external reference genomes while achieving betteroverall performance.In this study,new compression methods for FASTQ files are proposed.The new methods achieve a good balance between compression ratio and speed.It can help to reduce the storage and transmission pressures for high-throughput DNA sequencing data.This study can provide reference for future research.

Keywords/Search Tags:

DNA sequencing techniques, Reference-based compression, Reference-free compression, FASTQ

PDF Full Text Request

Related items

1	Based On Reference And GPU-accelerated Compression Method Of FASTQ Files
2	Research On FASTQ Gene Data Compression And Parallelization Based On Domestic Big Data Appliance
3	Design And Hardware Implementation Of Lossy Reference Frame Compression Algorithm For Low Power H.264 Coding
4	Reasearch On Fast And High Efficient Algorithm For HEVC Lossless Compression
5	Research On Algorithm Of Detecting Non-aligned Double JPEG Compression
6	Ralational Database Compression Based On Tuples-Clustering
7	Lossless Compression Of Hyperspectral Images Based On Prediction
8	Research On Pre-silicon Reference-free Hardware Trojan Detection Techniques
9	No-Reference Objective Quality Assessment For Networked Audio
10	The application of model reference adaptive control for vapour compression systems