Font Size: a A A

Research On Coding Of Genome Sequences Based On Context Weighting

Posted on:2019-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:L L XuFull Text:PDF
GTID:2370330548974394Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
As high-efficiency compression algorithms for genomic sequences continue to emerge,methods that use the statistical properties and repetitive characteristics within a sequence to perform biological sequence compression are constantly being optimized.Among them,aiming at the characteristics of high degree of similarity between DNA sequences of homologous species,a Context weighted model was constructed by using the target sequence,and the probability distribution was put into an arithmetic encoder to compress the DNA sequence.The effect was very significant.All the previous studies are how to optimize the weights,however,no one has studied whether the probability distribution of each moment in the weighting algorithm is involved in the selection of weights,so in view of the lack of the previous research on this study,the article is designed to use a selectable Context weighted model,judging the similarity of probability distribution based on the description length increment,decide whether to weigh,finally,we get the results of our research.First,we store the target sequence we have processed and leave it to be retrieved when we want to encode it.We need to consider a correlation feature between each character and propose to use a group of Context model weighted combination to effectively reduce the code length.Here we use an equal weight method.Then calculate the length of the description of the probability distribution in each model,then use the relationship between the description of the length increment and the threshold to determine the similarity of the probability distribution.If they are similar,use the weighted method to encode the code length.If the probability distributions are not similar,the probability distribution with the smallest information entropy is selected to encode,and finally the total code length is obtained.In addition,the value of the coded code length corresponding to the threshold under different conditions is counted and analyzed.The experimental results show that by describing the length to determine whether the probability distribution is similar to the reselection to do Context weighting,the compression efficiency of the target sequence can be better improved,that is,the code length can be effectively reduced,undistorted compression improve 6-thousandths of compression under one model.It also shows that in the process of gene sequence compression,using this method can improve our compression efficiency.
Keywords/Search Tags:DNA compression, Context weighted, Target sequence, Description length
PDF Full Text Request
Related items