Font Size: a A A

Research On The Third-generation DNA Sequencing Data Compression Method

Posted on:2021-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:H X CuiFull Text:PDF
GTID:2480306200950779Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Since the advent of the third generation sequencing technology,it has played an increasingly important role in clinical molecular diagnosis,especially in genome sequencing,methylation research,and mutation identification(SNP detection).With the continuous development of sequencing technology,the cost of sequencing decreases rapidly,and the amount of sequencing data increases sharply.How to store and transmit the large sequencing data becomes an urgent problem.Data compression technology can effectively reduce the storage space of the sequencing data and reduce the transmission time.The general compression tools cannot make good use of the data characteristics of DNA sequencing data,so their the compression ratios on sequencing data are unsatisfactory.The current specific compression tools for DNA sequencing data are mostly developed for the second generation sequencing data.In the face of the characteristics of long reading length,irregular reading length and high error rate of the third generation sequencing data,most of the compression tools cannot work correctly.Hence,designing compression tools specifically for third-generation DNA sequencing data is highly demanded.This dissertation introduces the research background and state of the art of DNA sequencing data compression.Two compression methods for third-generation DNA sequencing data are proposed.The main contributions of this dissertation are summarized as follows :(1)A third-generation DNA sequencing base data compression algorithm min Base Zip based on minimum hash and local sensitive hash techniques is proposed.The algorithm uses jecard coefficient to evaluate the similarity between sequences.The feature matrix is established for the whole base sequence,the similar sequences are quickly screened and grouped by minimum hash and local sensitive hash.Finally,the sequences within each group are compressed using gzip tools based on context characteristics.The experiments are conducted on publicly available datasets collected from multiple sequencing platforms.The comparison studies with multiple sequencing data-specific compression tools and universal compression tools show that min Base Zip can achieve better compression effect.(2)On the basis of the above-mentioned base data clustering,we further propose anassembly-based fastq complete data compression method min Compress.The algorithm divides the fastq file into three parts for compression.The base part uses wtdbg2 assembly tool to assemble each cluster file to obtain the reference genome,and then compresses each cluster file with the reference-based compression method LWFQzip2 and the obtained reference genome;the metadata and the mass fraction are compressed by delta encoding and run-length encoding,respectively.The proposed method is tested on long read sequencing fastq data from multiple sequencing platforms and compared with a variety of compression tools.The experimental results show that min Compress can obtain better compression ratio at reasonable time and space cost.This study is focused on the compression of the third generation DNA sequencing data,which is expected to alleviate the storage and transmission pressure caused by the third generation DNA sequencing data,and provide some inspiration for the subsequent related research.
Keywords/Search Tags:Single-molecule sequencing technology, Locality sensitive hashing, Reference genome, Data compression, minhash
PDF Full Text Request
Related items