Font Size: a A A

A Lossless Compression Algorithm For Genomic Data Based On Context Tree

Posted on:2022-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:L C KongFull Text:PDF
GTID:2504306335996709Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
With the rapid development of genome sequencing technology,the amount of genome data is increasing exponentially.How to efficiently compress and store each genome sequence has become one of the important research issues.Among them,the statistical compression algorithm based on entropy coding is an indispensable part in the field of genome sequence compression,and the PPM(Prediction by partial matching)algorithm based on Context tree model has gradually become the main choice of lossless compression in the mainstream genome compression algorithm.Since all symbols pass through the escape mechanism,the PPM algorithm always encodes the current symbol with the probability distribution of the highest order node,but the counting statistics of the highest-order probability distribution may not be sufficient,and there is still a certain degree of "Context dilution" problem.In order to solve these shortcomings,this paper uses the increment of description length as the criterion to combine the conditional probability distributions,so as to provide a more reasonable probability distribution estimation for the source sequence.First,we divide the genome into two parts.One part is called the effective symbol,which uses high-frequency base symbols to construct context tree model.The other part is called invalid symbol,which only needs to record its symbol,length,starting position and other information.In the process of constructing the Context tree model,since the count statistics information of the parent node is always more sufficient than that of the child nodes.It is possible to obtain better coding effect by adopting the count distribution encoding of the parent node after the escape mechanism of PPM algorithm.Secondly,according to the similarity between the child node and the parent node,the description length increment is used to judge whether to choose the probability distribution of the parent node.And set the threshold for counting scaling,better response and tracking the change characteristics of local statistics,so that the Context tree model can get better symbol adaptability with the change of the input stream.The experimental results show that it has better performance to use the increment of description length to judge the substitutability between parent and child nodes.Compared with the traditional PPM algorithm,the proposed algorithm in this paper can get better compression efficiency.No matter for the lossless compression process of large human genome sequence or small microbial genome sequence,the method in this paper can better reduce the storage space of genome sequences.
Keywords/Search Tags:DNA lossless compression, Context model, PPM algorithm, Description length
PDF Full Text Request
Related items