Font Size: a A A

High-throughput Genome Resequencing Data Compression Algorithm Based On Self-index Structure

Posted on:2019-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:H J RongFull Text:PDF
GTID:2370330566498092Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
There has been growing interest ingenome sequencing,driven by advancements in the sequencing technology.Althoughearly sequencing technologies required several years to capture a 3 billionnucleotide genome,genomes as large as 22 billion nucleotides are now beingsequenced within days using next-generation sequencing technologies.Assequencing speeds increase,the cost of sequencing has plummeted.As sequencingspeeds increase,the cost of sequencing has plummeted.Genome sequencing playsan important role in personalized medicine and public health.More and moregenomic sequencing data is constantly being generated,and these data need tobe stored,transmitted and analyzed.How to solve the contradiction betweenhighspeed growth data and limited storage space has become an importantresearch topic.DNA data compression provides an effective way to solve problems.However,due to the characteristics of the DNA data itself,the traditionalcompression method is difficult to achieve a good compression effect.In view of the above issues,theprevious two chapters investigated the current status of high-throughput data compression and analyzed theprinciples and challenges of the related compression algorithms.Finally,animproved high-throughputdata compression algorithm was proposed.The maincontribution of this study lie in:(1)Researched the storage formatof high-throughput datasets and existing compression algorithms.The biologicalcharacteristics of the sequencing data were analyzed.At the same time,the analysisshowed that the lossy compression of mass fractions can maintain better(sometimes even better)performance in downstream analysis while improvingcompression performance.(2)On the basis of the scheme ofdifferential compression coding based on reference genomes,a vertical codingmethod is adopted.At the same time,a combination of sparseness processing andmean processing is used for mass data to obtain better lossy compressionperformance.Indicates better compression.(3)For the data needs of randomdecompression and fast retrieval requirements,based on the analysis of theprinciple of self-index compression technology,a selfindexing compressiontechnology based on PBWT data structure is proposed.Experiments show that theintroduction of self-indexing technology in the random decompression have better performance.Based on the reference genome-basedcompression algorithm,this paper proposes a random decompression algorithmbased on self-index structure,which has certain advantages in compressionefficiency,and can meet the requirements of local retrieval and decompression.This can relieve the storage and transmission pressure of massivehigh-throughput data to a certain extent,providing experience and lessons forsubsequent research.
Keywords/Search Tags:DNA sequence compression, Reference-based compression, self-ind ex, Vertical Encoding
PDF Full Text Request
Related items