Font Size: a A A

Parallelized Fast Compression Method Of High-throughput DNA Sequencing Data

Posted on:2019-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:Q J DengFull Text:PDF
GTID:2428330566461897Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The advent of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing.Although the price of disk has been declining quickly these years,the increasing volume of raw data remains a hard problem to solve.By using effective compression methods to store DNA sequencing data,it can effectively reduce the storage space and the occupancy rate of transmission bandwidth.In the first part,this dissertation introduces the background and current status of the research on DNA sequencing data,including the development of sequencing technology,the storage format of DNA sequencing data,the development of DNA sequencing data compression technique and the existing related works.Afterward,two new compression methods called LW-FQZip 2 and KMCompress are proposed.Their performance is demonstrated in the comparison with the other state-of-the-art DNA sequencing data compression technologies.LW-FQZip 2 is an improved reference genome-based lossless compression method based on LW-FQZip 1.LW-FQZip 2 uses a parallel light-weight mapping model to match high-throughput sequencing short reads to a given reference genome.Then it uses prediction by partial matching model and arithmetic coder to compress mapping results and other data and achieves more effective coding and parallel computing performance.Experimental studies are conducted on both short read data and long read data generated by various sequencing platforms.The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs.KMCompress is a reference-free lossless compression method.Firstly,it reconstructs the input data rapidly and groups similar short-read/long-read data together.Secondly,it uses finite-context prediction model and arithmetic coder to estimate probability and encodes the sequencing data,which can effectively reduce the information entropy that needs to be recorded.To some extent,KMCompress overcomes the disadvantages of the reference-based compression method for it does not rely on external reference genomes while achieving betteroverall performance.In this study,new compression methods for FASTQ files are proposed.The new methods achieve a good balance between compression ratio and speed.It can help to reduce the storage and transmission pressures for high-throughput DNA sequencing data.This study can provide reference for future research.
Keywords/Search Tags:DNA sequencing techniques, Reference-based compression, Reference-free compression, FASTQ
PDF Full Text Request
Related items