Font Size: a A A

Research Of Reference-based Genome Sequence Data Compression Algorithm

Posted on:2020-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:W ShiFull Text:PDF
GTID:2370330575489312Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
With the development and the gradually popularized application of next-generation sequencing technologies(NGS),genome sequencing has been becoming faster and cheaper,creating a massive amount of genome sequence data which still grows at an explosive rate.The time and cost of transmission,storage,processing and analysis of these genome sequence data have become bottlenecks that hinder the development of genetics and biomedicine.Although there are many common data compression algorithms,they are not effective for genome sequences due to their inability to consider and exploit the inherent characteristics of genome sequence data.Therefore,the development of a fast and efficient compression algorithm specific to genome sequence data is an important and pressing issue.In this thesis,a high compression ratio reference-based lossless genome sequence data compression algorithm with better performance than previous algorithms is proposed.Using the high similarity of the genomes between the same species,the target genome sequence to be compressed is matched into the reference genome sequence.The target genome sequence is replaced by the matching result,that is the location,length and difference of the target sequence about the same sub-sequence of the two sequences.According to a carefully designed matching strategy selection mechanism,the advantages of local matching and global matching are innovative and reasonably combined together to improve the description efficiency of the matched sub-strings.According to the similarity degree of the reference and the target genome sequence,different matching strategies are adopted,and a hash method is used to search the same short character string between the two sequences.The effects of the length and the position of matched sub-strings to the compression efficiency are jointly taken into consideration.The various characters in the genome sequence data are effectively processed,and the matching efficiency is improved.Finally,the intermediate file that saves the matching result is compressed by the efficient entropy coding compressor.The experimental results show that the proposed compression algorithm can complete the compression of human genome sequence data in FASTA format of about 3GB size in up to 18 minutes.The compression size of 56 sets of human genome sequence test data is 4.45MB to 40.67MB.The average compression rate of the proposed compression algorithm is better than the existing genome sequence data compression algorithms of the same type,and has better robustness.Moreover,the space-time complexity of the proposed compression algorithm is in the same order of magnitude as the most advanced algorithms.It has strong practical application value.This paper also designs a corresponding efficient decompression algorithm,which can quickly and nondestructively restore target genome sequence data from compressed file and reference genome sequence data.The time taken to decompress the complete human genome data is controlled to within 2 minutes.
Keywords/Search Tags:Next generation sequencing, Genome, Lossless compression, Sequence matching
PDF Full Text Request
Related items