Research For Lossless Compression Algorithms Based On DNA Sequences

Posted on:2019-10-22

Degree:Master

Type:Thesis

Country:China

Candidate:W J Fan

Full Text:PDF

GTID:2370330590992331

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the development of information technology,more and more data have been recorded by scientists for various kinds of works and studies.In the field of biological information,DNA which is an important genetic material in living organisms stores a large amount of biological genetic information and guides biological development and life function operation.With the progress of DNA sequencing technology and help of other sequencing works,DNA sequence data has also been rapidly increasing,and the data volume growth rate is exponential.How to effectively store the rapidly expanding DNA sequence data in a limited storage space is a new topic that computer scientists and biologists face today.However,the compression results gen?erated by the original data compression algorithms are not ideal,and may even cause the storage space to expand.The unique properties of DNA sequences,for example sequence repeats,mirror repetitions,complementary palindromes and the highly repetitive sequences of similar species,make it possible to realize structural compression of DNA sequences.This article focuses on how to use more efficient compression methods to reduce data storage space.Aiming at the compression of DNA sequence with reference sequence,this paper proposes an efficient method which is based on full-text index to compress DNA sequence.In the first stage of compression,this method uses the efficient index structure FM-index to find and lo-cate the longest matching sequence in the reference sequence.Since the FM-index structure is usually used for fixed-length pattern matching,it is not conducive to match and record in-formation for the actual sequence.The improved FM-index can find and locate variable-length sequences within a limited time.In order to recover the input sequence lossless at the decod?ing end,this paper uses the complementary context model to calculate the symbol appearance probability according to different context models,and combines the sequential context with the non-sequential context models to calculate the prediction probability for arithmetic encoding,in order to achieve efficient lossless sequence compression.Experimental results show that this method outperforms other DNA compression algorithms in compression ratio on the condition that data is without preprocessing.Aiming at the DNA sequence compression without reference sequence,this paper proposes a sequence prediction and compression model based on auto-encoder.Using the convolutional layer structure can learn the feature representation of data.And a sparse representation unit of the sequence is obtained from the the encoder part of auto-encoder.The resulting representation unit is input to the decoding section to reconstruct the input sequence.In order to achieve the purpose of sequence lossless compression,the residue between the reconstructed sequence and the input sequence is recorded and encoded as another part of the compression coding.This article explores the possibility of using deep learning techniques to achieve lossless compression of sequences,and shows the network is able to learn implicit features of sequences.Experiments show that the method of DNA sequence reconstruction accuracy is above 98%,and the overall compression ratio is better than conventional compression methods'.

Keywords/Search Tags:

DNA lossless compression, FM-index, contextual model, autoencoder

PDF Full Text Request

Related items

1	A Lossless Compression Method Of Seismic Data
2	Research On Lossless Compression Algorithm For Time Series
3	Research On Lossless Compression Algorithms For FASTQ Files
4	Research On High Compression Ratio For The Data From The On-board Fourier Transform Spectrometer
5	Research On Lossless Compression Of High-throughput Genome Data
6	The Study And Implement Of Compression Algorithm For Lossless Base Logging Data
7	Research On Lossless Compression Of Sequential Images For Space Astronomical Observation
8	The Research And Implementation Of Data Compression In Vector Maps
9	Lossless Comprssion Of High-throughput DNA Sequence Data
10	Study On Lossless Compression Algorithm For Monitoring Data Of Complicated Engineering System