Font Size: a A A

Study On Algorithms For Compressed Full-text Self-indexes

Posted on:2015-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:L G ChenFull Text:PDF
GTID:2308330464464658Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text searching and processing has been a central issue in Computer Science research since its beginnings. The inverted list structure is an extremely popular index to handle so called ―natural language‖ texts, due its simplicity, low space requirements, and fast query execution. However, there are contexts cannot be factored out into sequences of words, typical examples include string matching, genome analysis, arbitrary pattern query and search engines for Far East languages. In these cases, we need full-text index, unfortunately, full-text index wasted a lot of space, that is why data structures like suffix tree and suffix array are heavy in practice, compressed full-text self-index is born to figure out this situation, whose size is proportional to that of the compressed text, a rapid sequence of achievements showed how to relate information theory with string matching concepts. It can answer pattern query very efficiently only use the compressed data with a negligible penalty in time and space performance. This paper is devoted to designing practical compressed full-text self-index structures, the main content of this dissertation is summarized as follows.The abuse using of running memory when computing suffix array makes the crucial procedure to act as a bottleneck in many applications. So, the first part focuses on solving the problem, we develop a lightweight construction DCV, which uses only 5~6 times running memory than original data. Experiments with several well known algorithms show that DCV is significant better than MM, DC3 and KA, comparative with LS.We also design and develop two practical compressed full-text self-index structures for the compressed suffix arrays, called CSA and Adaptive-CSA. We partition the gap sequence of Φ array into blocks and use gamma encoding, hybrid encoding for CSA and Adaptive-CSA, respectively. CSA and Adaptive-CSA retain the theoretical performance of previous work and introduces some improvements in practice. It can do count query in O(m log n) time, where m denotes the length of the pattern, n denotes the size of the text, and only need 2nHk(T) + n + o(n) bits in space, where Hk(T) denotes the k-th order entropy. Experiments on the Canterbury Corpus and the Pizza&Chili Corpus show significant advantages over other CSAs.Finally, we focus on designing an efficient FM-index, called Adaptive-FM, we use a versatile structure, i.e., wavelet tree(WT)[23], which is critical to the performance of space and speed. As a by-product, we develop an efficient BitMap structure supporting rank query, and acting as a significant component. Adaptive-FM partitions each BitMap of nodes in WT into blocks and applies hybrid encoding to achieve data-aware compression. Adaptive-FM retains the theoretical performance of previous work and introduces some improvements in practice. It can do count query in O(m log?) time, and needs 2nHk(T) + o(n)log? bits in space, where ? denotes the size of alphabet. Experiments on the Canterbury Corpus and the Pizza&Chili corpus show significant advantages, especially in space.Summing up, our works on SA, CSA and FM-index are excellent, Adaptive-CSA and Adaptive-FM both are data-aware and among the best ones.
Keywords/Search Tags:suffix array, compressed full-text self-index, CSA, FM-index, data-aware
PDF Full Text Request
Related items