Font Size: a A A

Compression Algorithms On Highly Similar Genome Collections

Posted on:2019-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:X GuoFull Text:PDF
GTID:2428330572951751Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of high-throughput sequencing technology,the cost required for sequencing has dropped drastically,and the volume of obtained sequencing data has risen sharply.Therefore,effectively storing and searching these genome data has become an urgent problem to solve.Compression technology can effectively reduce the storage space,excellent compression methods have become a key technology to solve this problem.The study shown that the structure of gene sequences is different from that of text data.The sequence contains more redundant information,and the sequence similarity between the same species or the more similar species classified according to the taxonomy is high,and the general compression method cannot effectively compress and search genomic sequence sets.So,it is necessary to propose a compression method for highly similar genome collections.In this thesis,we present a novel reference-based compression method named FMQ,which can efficiently store and search genome collections.The main idea is that we choose a sequence as the reference sequence,and then compute the variations between the reference sequence and the others.Then we propose several compressed data structures to store these variations.Meanwhile,we also provide efficient query algorithm,including extract query and locate query.The specific work is as follows:Compressed data structures.Firstly,we randomly select a sequence as the reference sequence,and obtain the distinct segments between the reference sequence and the others.We use the longest common subsequence algorithm compute the common parts between these distinct segments.The variations between sequences is obtained based on the common part.Finally,we propose different compressed data structures to store the variations based on their characteristics.These compressed index structures occupy o(n)bits,where n is the length of the reference sequence.Query on the compressed index structure.Based On the compressed index structure,we propose extract and locate query algorithms,respectively.The extract query algorithm can extract a substring in O(n/log n+len+log~2n/loglog n)time,where len is the length of the substring.The locate query divides the pattern into a set of sub-patterns,and obtain the positions of these sub-patterns using the compressed data structure.Then locate query filters out all the positions of the pattern through a hash table.The time complexity of locate query algorithm is O(m(9)occ(9)n/log n),where m is the number of sequences and occ is the sum of the number of occurrences of all sub-patterns in the reference sequence.Experimental results.We test the index construction time,compression ratio and search performance of FMQ on the Pizza&Chili Corpus data set.The results show that FMQ performs better than several the-state-of-art compression algorithms on index construction time and compression ratio;Its index construction time is the shortest,and the index size only accounts for 10%-60%of that of the compared methods.In terms of query performance,FMQ's locate query performs better.At the same time,the extract query has similar performance compared with the tested methods.
Keywords/Search Tags:genome collections, reference sequence, compressed index, locate query, extract query
PDF Full Text Request
Related items