A Lossless Compression Algorithm For Genomic Data Based On Context Tree

Posted on:2022-07-08

Degree:Master

Type:Thesis

Country:China

Candidate:L C Kong

Full Text:PDF

GTID:2504306335996709

Subject:Computer Software and Application of Computer

Abstract/Summary:

PDF Full Text Request

With the rapid development of genome sequencing technology,the amount of genome data is increasing exponentially.How to efficiently compress and store each genome sequence has become one of the important research issues.Among them,the statistical compression algorithm based on entropy coding is an indispensable part in the field of genome sequence compression,and the PPM（Prediction by partial matching）algorithm based on Context tree model has gradually become the main choice of lossless compression in the mainstream genome compression algorithm.Since all symbols pass through the escape mechanism,the PPM algorithm always encodes the current symbol with the probability distribution of the highest order node,but the counting statistics of the highest-order probability distribution may not be sufficient,and there is still a certain degree of "Context dilution" problem.In order to solve these shortcomings,this paper uses the increment of description length as the criterion to combine the conditional probability distributions,so as to provide a more reasonable probability distribution estimation for the source sequence.First,we divide the genome into two parts.One part is called the effective symbol,which uses high-frequency base symbols to construct context tree model.The other part is called invalid symbol,which only needs to record its symbol,length,starting position and other information.In the process of constructing the Context tree model,since the count statistics information of the parent node is always more sufficient than that of the child nodes.It is possible to obtain better coding effect by adopting the count distribution encoding of the parent node after the escape mechanism of PPM algorithm.Secondly,according to the similarity between the child node and the parent node,the description length increment is used to judge whether to choose the probability distribution of the parent node.And set the threshold for counting scaling,better response and tracking the change characteristics of local statistics,so that the Context tree model can get better symbol adaptability with the change of the input stream.The experimental results show that it has better performance to use the increment of description length to judge the substitutability between parent and child nodes.Compared with the traditional PPM algorithm,the proposed algorithm in this paper can get better compression efficiency.No matter for the lossless compression process of large human genome sequence or small microbial genome sequence,the method in this paper can better reduce the storage space of genome sequences.

Keywords/Search Tags:

DNA lossless compression, Context model, PPM algorithm, Description length

PDF Full Text Request

Related items

1	Biomedical Signal Lossless / Near Lossless Compression Techniques
2	Ecg And Mri Data Lossless Compression Algorithm Of The Preliminary Study
3	Lossless Medical Image Compression Based On Region Segmentation And Regression Model
4	Near-lossless ECG Data Compression Method For Mobile Tele-cardio-monitoring
5	Application Research Of Image Compression Algorithm Based On Wavelet Transform In Telemedicine
6	Data Compression Algorithm And DSP Realization Of Medical Signal Based On Lifting Wavelet
7	Distribution And Prediction Of Patients’ Length Of Stay
8	Remote Ecg Monitoring System Of High-risk Heart Patients Lower Machine Software Design And Ecg Compression Algorithm
9	Compression And Streaming Of Large-Scale ECG Data For Teledata Sharing
10	Deep Learning-based Invertible Coil Compression For MRI