Font Size: a A A

Processing And Optimization Techniques For Online Searching Subsequences On Large Compressed Sequences

Posted on:2011-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z B LiFull Text:PDF
GTID:2248330395457370Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of society and science, the amount of information data increases exponentially, especially genomic sequences data, its increasing rate is incredible, which poses a big pressure on the storage and transmission of data, so there are many compression techniques proposed to solve this problem. Searching subsequences on compressed sequences is very important in practical applications such as genomic sequence analysis and keyword searching, where their sequence data is always kept compressed in order to reduce the great cost of storage space, so that it brings so much difficulty on searching subsequences, and we face great challenges in this research area.In this paper, the problem we studied is based on a lossless compression technique proposed recently. For several similar sequences, it stores a reference sequence and encodes only the differences with respect to the original sequence using edit operations, and it has a good compression effect for large character sequences such as gene. This compression technique is not only new, but also very important, so how to online search subsequences on this kind of large compressed sequences is the problem we solve in this paper.For these large lossless compressed sequences mentioned above, according to whether the reference sequence is online procecessed or not, we consider two settings and present the processing and optimization techniques for searching subsequences respectively. When the reference sequence is online processed, we make use of Boyer-Moore algorithm and present a searching subsequences algorithm on single compressed sequence, due to there are many common segments in reference sequence for multiple compressed sequences, then we present an searching algorithm on multiple compressed sequences. When the reference sequence is offline processed, we build a q-gram based index on it, and present a basic searching algorithm first, and then present an optimized searching algorithm taking advantage of the filtering of edit operations. We have conducted experiments on a genomic dataset, and the results show that the techniques can search subsequences on these large lossless compressed sequences efficiently.
Keywords/Search Tags:compressed sequences, online, searching subsequences, index, filter
PDF Full Text Request
Related items