High-throughput Genome Resequencing Data Compression Algorithm Based On Self-index Structure

Posted on:2019-04-06

Degree:Master

Type:Thesis

Country:China

Candidate:H J Rong

Full Text:PDF

GTID:2370330566498092

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

There has been growing interest ingenome sequencing,driven by advancements in the sequencing technology.Althoughearly sequencing technologies required several years to capture a 3 billionnucleotide genome,genomes as large as 22 billion nucleotides are now beingsequenced within days using next-generation sequencing technologies.Assequencing speeds increase,the cost of sequencing has plummeted.As sequencingspeeds increase,the cost of sequencing has plummeted.Genome sequencing playsan important role in personalized medicine and public health.More and moregenomic sequencing data is constantly being generated,and these data need tobe stored,transmitted and analyzed.How to solve the contradiction betweenhighspeed growth data and limited storage space has become an importantresearch topic.DNA data compression provides an effective way to solve problems.However,due to the characteristics of the DNA data itself,the traditionalcompression method is difficult to achieve a good compression effect.In view of the above issues,theprevious two chapters investigated the current status of high-throughput data compression and analyzed theprinciples and challenges of the related compression algorithms.Finally,animproved high-throughputdata compression algorithm was proposed.The maincontribution of this study lie in:(1)Researched the storage formatof high-throughput datasets and existing compression algorithms.The biologicalcharacteristics of the sequencing data were analyzed.At the same time,the analysisshowed that the lossy compression of mass fractions can maintain better(sometimes even better)performance in downstream analysis while improvingcompression performance.(2)On the basis of the scheme ofdifferential compression coding based on reference genomes,a vertical codingmethod is adopted.At the same time,a combination of sparseness processing andmean processing is used for mass data to obtain better lossy compressionperformance.Indicates better compression.(3)For the data needs of randomdecompression and fast retrieval requirements,based on the analysis of theprinciple of self-index compression technology,a selfindexing compressiontechnology based on PBWT data structure is proposed.Experiments show that theintroduction of self-indexing technology in the random decompression have better performance.Based on the reference genome-basedcompression algorithm,this paper proposes a random decompression algorithmbased on self-index structure,which has certain advantages in compressionefficiency,and can meet the requirements of local retrieval and decompression.This can relieve the storage and transmission pressure of massivehigh-throughput data to a certain extent,providing experience and lessons forsubsequent research.

Keywords/Search Tags:

DNA sequence compression, Reference-based compression, self-ind ex, Vertical Encoding

PDF Full Text Request

Related items

1	Research On Fast Migration Algorithm Between Reference Gene Compression Libraries
2	Lossless Comprssion Of High-throughput DNA Sequence Data
3	The Research Of Reference-based Compression Specified For Sequence Data
4	Research Of Reference-based Genome Sequence Data Compression Algorithm
5	Compression Of DNA Sequences Based On Reference Sequences And Weighting Of Context Models
6	Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array
7	Research On Cloud Platform Oriented Efficient Storage Compression Of Bioinformatics Data
8	Research On Compression And Assembly Of Biological Sequencing Data Based On Non-reference Genomes
9	The Design Of H.264 Video Compression And Encoding Based On SoC-FPGA
10	Optimization And Implementation Of Lossless Compression Of Gene Sequencing Data