Font Size: a A A

Research On Compressed Index Based On Wavelet Tree And Its Extended Application

Posted on:2022-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:P F LiuFull Text:PDF
GTID:2518306605467754Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Big data is growing at an exponential rate,and a large amount of data is text data.As text data is the main medium of information transmission,the cost of storage,transmission and retrieval of text information has risen sharply.Therefore,it is necessary to compress the text to save storage and transmission costs,and to establish a compressed index that supports efficient retrieval.Compressed self-indexes are used widely in many string processing applications,such as information retrieval,genome analysis,data mining,and web searching.The index not only indexes the data,but also encodes the data,and it is in compressed form.Moreover,the index and the data it encodes can be operated upon directly,without need to uncompress the entire index,so as to save time while maintaining small storage space.In some applications,such as pan-genomics analysis,because the pan-genomics is associated with a graph-like structure,it is impossible to make full use of the compressed self-indexing for sequences,so it is necessary to find a method to represent the pan-genomics.This thesis establishes a high-order entropy-compressed text self-indexing framework based on the wavelet tree,proposes a hybrid coding method to represent the wavelet tree nodes,and gives a theoretical analysis on the space usage of the proposed text self-indexing.Then,facing the potential application of pan-genomics data in read alignment problems,the compressed self-index on the sequence is extended to the compressed self-index on the graph data.The main innovations of this thesis are as follows:A high-order entropy-compressed text self-indexing,call WFM,based upon wavelet tree is proposed,and a method for fast retrieval in compressed space is established.WFM uses the Burrows-Wheeler transform(BWT)and the wavelet tree,combined with hybrid encoding and succinct data structures,to achieve minimal space usage and fast retrieval on the compressed indexes.For the text T=[0,n-1]consists of n symbols from an alphabet(50)of size?,WFM uses 2 n H_k+o(n log?)bits of space,for some constant 0<c<1,where H_kdenotes the kth-order empirical entropy of T,and k clog_?n-1.Given any pattern P of m symbols encoded in m log?bits,the Count query of WFM to return the suffix range of pattern P runs in O(m log?)time;the Locate query of WFM to locate all the occ occurrences of P in T runs in O((m+occ log n loglog n)log?)time.the Extract query of WFM to extract a text substring T[st,st+len-1]runs in O((len+log n loglog n)log?)time,given.A graph compressed index for pan-genomics is proposed.For a given reference genome and corresponding known mutation information,which simple constitute sets of aligned sequences(defined as pan-genomics).These sequences can be linked in a graph-like structure.Therefore,the corresponding directed acyclic label graph can be constructed,Then each node in the graph is sorted according to the common prefix of all paths(lexicographically adjacent suffixes)of the graph,and the BWT transformation and auxiliary flag bit sequences of the corresponding graph are constructed.The compressed wavelet tree is used to represent the BWT string,and node information on graph data is sampled and stored to support path location query.Experiments on the benchmark text data set show that for text data of different sizes of alphabets,compared with mainstream indexing methods,WFM has some advantages in compression rate and query time,especially when the block size b=128,the advantage of query time is more significant.In fact,we can weigh space occupation and query time by setting different parameter values.When b increases,the space occupied by WFM will decrease and the query time will increase;when b decreases,the space occupied by WFM will increase and the query time will decrease.Similarly,the difference in the sampling size of the suffix array s and the change in the block size bring about the same space and query trends.Experiments on the benchmark pan-genomics data show that compared with the state-of-the-art method GCSA for representing pan-genomics,GFM has advantages in space occupation and positioning query.This laid the foundation for the further application of GFM to large-scale read alignment to pan-genomics.
Keywords/Search Tags:Wavelet tree, Compressed index, Hybrid encoding, Graph indexing, pan-genomics
PDF Full Text Request
Related items