Font Size: a A A

Research For Lossless Compression Algorithms Based On DNA Sequences

Posted on:2019-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:W J FanFull Text:PDF
GTID:2370330590992331Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology,more and more data have been recorded by scientists for various kinds of works and studies.In the field of biological information,DNA which is an important genetic material in living organisms stores a large amount of biological genetic information and guides biological development and life function operation.With the progress of DNA sequencing technology and help of other sequencing works,DNA sequence data has also been rapidly increasing,and the data volume growth rate is exponential.How to effectively store the rapidly expanding DNA sequence data in a limited storage space is a new topic that computer scientists and biologists face today.However,the compression results gen?erated by the original data compression algorithms are not ideal,and may even cause the storage space to expand.The unique properties of DNA sequences,for example sequence repeats,mirror repetitions,complementary palindromes and the highly repetitive sequences of similar species,make it possible to realize structural compression of DNA sequences.This article focuses on how to use more efficient compression methods to reduce data storage space.Aiming at the compression of DNA sequence with reference sequence,this paper proposes an efficient method which is based on full-text index to compress DNA sequence.In the first stage of compression,this method uses the efficient index structure FM-index to find and lo-cate the longest matching sequence in the reference sequence.Since the FM-index structure is usually used for fixed-length pattern matching,it is not conducive to match and record in-formation for the actual sequence.The improved FM-index can find and locate variable-length sequences within a limited time.In order to recover the input sequence lossless at the decod?ing end,this paper uses the complementary context model to calculate the symbol appearance probability according to different context models,and combines the sequential context with the non-sequential context models to calculate the prediction probability for arithmetic encoding,in order to achieve efficient lossless sequence compression.Experimental results show that this method outperforms other DNA compression algorithms in compression ratio on the condition that data is without preprocessing.Aiming at the DNA sequence compression without reference sequence,this paper proposes a sequence prediction and compression model based on auto-encoder.Using the convolutional layer structure can learn the feature representation of data.And a sparse representation unit of the sequence is obtained from the the encoder part of auto-encoder.The resulting representation unit is input to the decoding section to reconstruct the input sequence.In order to achieve the purpose of sequence lossless compression,the residue between the reconstructed sequence and the input sequence is recorded and encoded as another part of the compression coding.This article explores the possibility of using deep learning techniques to achieve lossless compression of sequences,and shows the network is able to learn implicit features of sequences.Experiments show that the method of DNA sequence reconstruction accuracy is above 98%,and the overall compression ratio is better than conventional compression methods'.
Keywords/Search Tags:DNA lossless compression, FM-index, contextual model, autoencoder
PDF Full Text Request
Related items