Compression Of DNA Sequences Based On Reference Sequences And Weighting Of Context Models

Posted on:2018-04-08

Degree:Master

Type:Thesis

Country:China

Candidate:R S Wang

Full Text:PDF

GTID:2370330518954924

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

As we gain further insight into the characteristics of DNA sequences,there is a pressing need of efficient compression.According to the high similarity of DNA sequences among homologous species,GReEn,using reference sequence to build probabilistic copy models and arithmetic encoder to encode DNA,has got a significant performance.But the compression performance has a sharp decline when the target sequence is different from the reference sequence.This paper use weighted Context models to solve this problem.First of all,we build a Hash table and use LinkList to store each k-mers string in the reference sequence,do the same things to the target sequence and compared it with the reference sequence.Then using the weighted Context models to encode the places which are different from the reference sequence.Considering Minh.D.C.theory:there is a positive relationship between the weights of Context models and the reciprocal log of the description length,we propose a multi group weighted context models to reduce the code length.We sort and count the description length of each model then calculate logarithm and derivation of the statistics.Finally update the weight with the statistical characteristic of description length.The experimental results show that we can improve the compression efficiency by using the weighting of Context models when the target sequence is different from the reference sequence.It also prove the way that based on the reference sequence and the weighting of Context models,we can improve the compression efficiency in the process of DNA compression.

Keywords/Search Tags:

DNA compression, Context weighted, reference sequence, description length, arithmetic encoder

PDF Full Text Request

Related items

1	Research On Coding Of Genome Sequences Based On Context Weighting
2	The Optimized Context Modeling And Its Application On Microbial Genome Sequence Compression And Image Compression
3	Research On Fast Migration Algorithm Between Reference Gene Compression Libraries
4	Lossless Comprssion Of High-throughput DNA Sequence Data
5	Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array
6	Research Of Reference-based Genome Sequence Data Compression Algorithm
7	High-throughput Genome Resequencing Data Compression Algorithm Based On Self-index Structure
8	The Research Of Reference-based Compression Specified For Sequence Data
9	Research On Compression And Assembly Of Biological Sequencing Data Based On Non-reference Genomes
10	The Research On DNA Sequence Compression With Transform Coding And Entropy Coding