Font Size: a A A

Research On DNA Sequences Compression Algorithm Based On Statistical Theory

Posted on:2016-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:X K TongFull Text:PDF
GTID:2180330479493856Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
DNA, storing genetic information of life, is the material foundation for the survival, development and evolution of life. The study of DNA sequences has great social and scientific significance. As the study objects, the DNA sequences are large in number, and the requirements for the study of information exchanges are gradually increased. In order to carry out effective storage and transfer, it is necessary to develop DNA sequence compression technology. In recent years, various types of compression algorithms specific to the particularity of DNA sequences have constantly emerged and certain progress has been achieved in this area.Currently there are two types of DNA sequence compression algorithms: One is substitution-based algorithms and the other is statistical information based algorithm. In this thesis innovative improvements are made respectively according to the principles and features of the two types of algorithms, and two new algorithms are proposed as follows:Firstly, a DNA sequences compression algorithm based on the mixed experts is proposed for the statistical information based DNA sequences compression in this thesis. The mixed experts are the innovation of the experts in XM algorithm(the state-of the-art statistical information based algorithm) and can make better use of data features to carry out probability distribution estimation of sequence symbols. Then carry out arithmetic coding of probability distribution estimation and realize compression coding. Compared with the XM algorithm, the presented algorithm can achieve better compression effect.A DNA sequences compression algorithm based on the construction of iteration dictionary is also discussed in this thesis. In each iteration process, create a dictionary to select the highest frequency sequence segment with suitable length. And then use the predefined non-termination character to substitute the highest frequency sequence segment in this iteration. After sequence substitution, input the sequence to the next iteration until iteration termination. The output of the final sequence is the compression result. The results show that the compression reached a new height. Before the main compression process of construction of iteration dictionary, this paper also proposes a pretreatment process based on improvement LZ compression algorithm. The improved LZ algorithm needs to meet the requirements of pretreatment that the output are the four symbols and the DNA sequences were reduced to a certain extent. The final compression effect is better than the single use of iterative dictionary construction.
Keywords/Search Tags:DNA sequences compression, mixed expert, construction of iterative dictionary, improvement LZ algorithm
PDF Full Text Request
Related items