Research On Coding Of Genome Sequences Based On Context Weighting

Posted on:2019-07-07

Degree:Master

Type:Thesis

Country:China

Candidate:L L Xu

Full Text:PDF

GTID:2370330548974394

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

As high-efficiency compression algorithms for genomic sequences continue to emerge,methods that use the statistical properties and repetitive characteristics within a sequence to perform biological sequence compression are constantly being optimized.Among them,aiming at the characteristics of high degree of similarity between DNA sequences of homologous species,a Context weighted model was constructed by using the target sequence,and the probability distribution was put into an arithmetic encoder to compress the DNA sequence.The effect was very significant.All the previous studies are how to optimize the weights,however,no one has studied whether the probability distribution of each moment in the weighting algorithm is involved in the selection of weights,so in view of the lack of the previous research on this study,the article is designed to use a selectable Context weighted model,judging the similarity of probability distribution based on the description length increment,decide whether to weigh,finally,we get the results of our research.First,we store the target sequence we have processed and leave it to be retrieved when we want to encode it.We need to consider a correlation feature between each character and propose to use a group of Context model weighted combination to effectively reduce the code length.Here we use an equal weight method.Then calculate the length of the description of the probability distribution in each model,then use the relationship between the description of the length increment and the threshold to determine the similarity of the probability distribution.If they are similar,use the weighted method to encode the code length.If the probability distributions are not similar,the probability distribution with the smallest information entropy is selected to encode,and finally the total code length is obtained.In addition,the value of the coded code length corresponding to the threshold under different conditions is counted and analyzed.The experimental results show that by describing the length to determine whether the probability distribution is similar to the reselection to do Context weighting,the compression efficiency of the target sequence can be better improved,that is,the code length can be effectively reduced,undistorted compression improve 6-thousandths of compression under one model.It also shows that in the process of gene sequence compression,using this method can improve our compression efficiency.

Keywords/Search Tags:

DNA compression, Context weighted, Target sequence, Description length

PDF Full Text Request

Related items

1	Compression Of DNA Sequences Based On Reference Sequences And Weighting Of Context Models
2	The Optimized Context Modeling And Its Application On Microbial Genome Sequence Compression And Image Compression
3	The Study On The Fine Description Of Reservoir And The Predicting Of Advantageous Exploration Target In Weibei Depression
4	Research On Cloud Platform Oriented Efficient Storage Compression Of Bioinformatics Data
5	The Relationship Between The Amino Acid Residues Of Context And The Secondary Structure Of Protein Of Target Sequence
6	Lossless Comprssion Of High-throughput DNA Sequence Data
7	Research On Fast Migration Algorithm Between Reference Gene Compression Libraries
8	Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array
9	Based On Class Background And The Background Of The Conditions Of Epitaxial Concept Lattice Compression
10	Duquenne-Guigues Bases Of Lattice -valued Fuzzy Description Logic L-ALCN