Font Size: a A A

Compressed Index And Query Optimization On Large-Scale Genomic Data

Posted on:2016-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhaoFull Text:PDF
GTID:2428330542455400Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of next-generation genomic sequencing technology,genomic sequence data continues to grow at an exponential rate.It is estimated that each individual human genome contains about three billion letters and storing these information requires large-capacity hardware and costs expensive of time and space,where the cost is rather high.Therefore,how to efficiently store and query these genomic sequence data becomes an increasingly urgent problem in genomics and medical science.Some relative studies have shown the similarity between any two genomic sequence data is extremely high.Based on this theory,the program called "DNAzip" whose main idea is to select a genomic sequence as the base sequence and store the other sequences as the differences from the base sequence,has been proposed.With this technology,we can effectively solve the problem of storage of genomic sequence data.However,there are many applications in real life require querying on genomic sequence data.The traditional solutions,which answer queries through decompressing the compressed genomic sequence data,require powerful computing capabilities,huge time cost,as well as extra space overhead.Therefore,our goal in this thesis is to develop methods that can do query search directly on compressed sequences fastly and efficiently,and then effectively solve the above problems.In this thesis,we review the existing pattern matching technology,study the the existing methods which directly query on compressed genomic sequence,and optimize these existing methods.For the shortcomings that the head_tail inverted index used in Min_Verify and C_Verify methods can not directly access to specific modifications,we firstly proposed new index structures,and then constructed the basic exact matching algorithm based on the new index structure.Since there are many dumplicate verifications in exact matching algorithm,we propose two filtering technique for the purpose of reducing query time:first,filter the dumplicate verifications of head gram and tail gram;second,filter the extra verifications of sequerces which do not modify the query interval on the base sequence.Finally,we optimized exact matching algorithm and approximate matching algorithm using the two filtering technique above.We have done lots of testing research on real datasets,compared with the existing algorithms on index size and query performance,and analyzed the experimental results.Experimental results showed that the optimized querying algorithms with filtering technique proposed in this thesis are superior to the other algorithms at run time,filtering capabilities and index size,and the bigger the length of query string is,the better the performance of the algorithm and the filtering capacity becomes.
Keywords/Search Tags:Genomic Sequence Data, Inverted index, Compressed data, Pattern Matching, Query optimization
PDF Full Text Request
Related items