Research On DNA Sequences Compression Algorithm Based On Statistical Theory

Posted on:2016-03-18

Degree:Master

Type:Thesis

Country:China

Candidate:X K Tong

Full Text:PDF

GTID:2180330479493856

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

DNA, storing genetic information of life, is the material foundation for the survival, development and evolution of life. The study of DNA sequences has great social and scientific significance. As the study objects, the DNA sequences are large in number, and the requirements for the study of information exchanges are gradually increased. In order to carry out effective storage and transfer, it is necessary to develop DNA sequence compression technology. In recent years, various types of compression algorithms specific to the particularity of DNA sequences have constantly emerged and certain progress has been achieved in this area.Currently there are two types of DNA sequence compression algorithms: One is substitution-based algorithms and the other is statistical information based algorithm. In this thesis innovative improvements are made respectively according to the principles and features of the two types of algorithms, and two new algorithms are proposed as follows:Firstly, a DNA sequences compression algorithm based on the mixed experts is proposed for the statistical information based DNA sequences compression in this thesis. The mixed experts are the innovation of the experts in XM algorithm(the state-of the-art statistical information based algorithm) and can make better use of data features to carry out probability distribution estimation of sequence symbols. Then carry out arithmetic coding of probability distribution estimation and realize compression coding. Compared with the XM algorithm, the presented algorithm can achieve better compression effect.A DNA sequences compression algorithm based on the construction of iteration dictionary is also discussed in this thesis. In each iteration process, create a dictionary to select the highest frequency sequence segment with suitable length. And then use the predefined non-termination character to substitute the highest frequency sequence segment in this iteration. After sequence substitution, input the sequence to the next iteration until iteration termination. The output of the final sequence is the compression result. The results show that the compression reached a new height. Before the main compression process of construction of iteration dictionary, this paper also proposes a pretreatment process based on improvement LZ compression algorithm. The improved LZ algorithm needs to meet the requirements of pretreatment that the output are the four symbols and the DNA sequences were reduced to a certain extent. The final compression effect is better than the single use of iterative dictionary construction.

Keywords/Search Tags:

DNA sequences compression, mixed expert, construction of iterative dictionary, improvement LZ algorithm

PDF Full Text Request

Related items

1	The Design&implementation Of Compression Software For Well Logging Results
2	Convergence And Iterative Algorithms For Systems Of Generalized Mixed Equilibrium Problems
3	Comparison Of The Effect Of Low Discrepancy Sequences Using Monte Carlo Method To Price American Options
4	Research And Application Of Seismic Data Denoising Based On Online Dictionary Learning Algorithm
5	Iterative Approximations Of Fixed Points For Several Classes Of Nonlinear Mappigs
6	Compression Of DNA Sequences Based On Reference Sequences And Weighting Of Context Models
7	Iterative Algorithm For Solving Variational Inclusions And Equilibrium Problems
8	Research And Application Of Comprehensive Evaluation Index System Construction Under The Constraints Of Expert Consultation
9	Iterative Algorithms For Solving A Split Equality Fixed Point Problem
10	Existence And Iterative Algorithms For Mixed Quasi-variational-like Inequalities