The Compaction Of Full-text Indexing Structures And Its Applications

Posted on:2010-09-16

Degree:Master

Type:Thesis

Country:China

Candidate:C Liu

Full Text:PDF

GTID:2178360302965919

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Index structures, such as Directed Acyclic Word Graph (DAWG), suffix trees, play an important role in many applications such as pattern matching, intrusion detection systems and computational biology. In this thesis, we review several classical index structures, including: suffix trie, DAWG, suffix tree, factor oracle. The result about the space and time complexity of these data structures are given in the thesis, along with the construction algorithm for these structures. We also study the relations between the index structures.Suffix trees are basic data structures for string algorithms. A pattern can be found in time proportional to the pattern length from a text by constructing the suffix tree of the text in advance. The suffix tree can also be used for more complicated problems, for example finding the longest repeated substring in linear time. Many efficient string algorithms are based on the use of suffix trees because this does not increase the asymptotic time complexity. A suffix tree of a string can be constructed in linear time in the string length.The DAWG, or Directed Acyclic Word Graph, is the smallest fnite state automaton that recognizes all the substrings appearing in the given text. Based on DAWG, the existence of a pattern P in T can be determined in O(p) time. By compacting the edges of the DAWG and augmenting additional information in the nodes, we obtain the labeled compact DAWG (CDAWG), which is equivalently obtained from the suffix tree of the text by merging its edge-isomorphic subtrees and deleting part of the resulting structure. CDAWG provides significant reductions of the memory space required by suffix trees and DAWGs; nevertheless, in order to support locating all the occurrences of a pattern in the text, the space requirement is still O(n log n) bits. The major drawback that limits the applicability of indexes is their space complexity. In this thesis, we study the Weighted Directed Word Graph (WDWG), a new index structure that is as efficient as the DAWG, but more space economical. WDWGs compact the DAWG in a new way that merges states to fewer state sets and the edges that emit from state set to fewer weighted edges. As a result, a family of weighted automata is gotten that can be viewed as an alternative to suffix trees and DAWGs. For a given word w, the WDWG of w has at most |w| + 1 states and 2|w| ? 1 transition edges. On-line algorithm to build WDWG from input word in liner time is implemented in this thesis. Several experiments are conducted in this these, the results show that WDWGs are space efficient than DAWG.Suffix array was proposed to reduce the space cost of suffix trees. It consists of the values of the leaves of the suffix tree in in-order, but without the tree structure information, hence takes only nlogn bits. Recent researches are focused on reducing the sizes of full-text indices. The compressed suffix array structure proposed by Grossi and Vitter is the first method that reduces the size of the suffix array from O(nlogn) bits to O(n) bits and supports access to any entry of the original suffix array in O(log|Σ|ε)time, for any fixed constant 0<ε<1 (without computing the entire original suffix array). FM-index proposed by Ferragina and Manzini is a self-index data structure with good compression ratio and fast decompressing speed. The FM-index occupies at most 5nHk (T) + o(n) bits of storage and allows the search for the occ occurrences of a pattern P[1..p] within T in O(p+occlog1+εn) time.He present a succinct representation of suffix arrays of binary strings that uses n + o(n) bits. For the case of large alphabet, they suggested an approach which conceptually sets a bit vector for each alphabet symbol to support operations $rank$ and $select$, and uses a wavelet tree in the actual implementation to save space. They also proved a categorization theorem by which one can determine whether a given permutation is the suffix array for a binary string.We study a data structure to support rank and select operations on an alphabetΣusing nlog|Σ|+o(nlog|Σ|) bits in O(log|Σ|) time for a text of length n. It also supports an extended rank, namely rank≤, such that rank≤(T, i) returns the number of letters which are smaller thanαin string T, plus the number ofαs up to position i. Also, it runs in O(log|Σ|) time. By this structure, we implement the DAWG succinctly. The main structure only takes nlog|Σ| + o(nlog|Σ|) bits and supports basic operations of DAWG efficiently.

Keywords/Search Tags:

full-text index structure, string matching algorithm, suffix automaton

PDF Full Text Request

Related items

1	An Algorithm For Matching Strings With Wildcards
2	Study On Automaton-Based Regular Expression Matching Algorithms
3	A Full-text Indexing Model Based On Suffix Array And Posting List
4	Study On Algorithms For Compressed Full-text Self-indexes
5	Study On The Algorithm For Full-Text Self-indexes Compression
6	The Research And Implementation Of Bit-parallel Based Dictionary Searching
7	The Research Of Distributed Index Technology Based On Self-indexed Compressed Full-text
8	The String Pattern Searching Algorithms Based On Suffix Arrays
9	Research On Construction Of Index Structure For Biological Sequences
10	Research On Full-Text Index Model Based On Full-Text Database