Font Size: a A A

Study On Algorithms For Full-text Indexes In External Memory

Posted on:2016-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y H ZhaoFull Text:PDF
GTID:2348330488474186Subject:Engineering
Abstract/Summary:PDF Full Text Request
Indexing massive data and providing retrieval ability on it is the basic requirement of people, especially in the Internet age. When retrieving the required data logging in the database and searching the web which meet your requirement, we need to index the data. For legal documents and bioinformatics data, we even need to provide full-text retrieval ability. Index data structure consists of internal memory and external memory index. As for those classic index structures, such as the suffix tree and suffix array, they have a fast retrieval speed but they take up too much space. Compressed full-text index such as CSA and FM-Index are not suitable for external memory.This paper puts forward m KD-GBWT which is a kind of time and space efficient external memory compressed text index. Theoretically m KD-GBWT takes O?n log s? bits space, and can support pattern matching in O(|P|/B +?n/B?1/2+ occ / B)I/Os,in which |P| is the length of pattern, n is size of text, s is the number of different characters of T and B is the size of a disk page.First, this paper proposes efficient external memory string B-tree algorithm, external memory structures of kd-tree and the orthogonal range searching algorithm. Then we detail principle of multiple kd-trees orthogonal range searching data structure and the pattern matching algorithm of it. We analyze the performance of our m KD-GBWT. At last we implement m KD-GBWT index and test for a wide variety of data such as uniform distribution data and highly repetitive data. Experiments show that, compared with the latest external memory compressed index GBWT in 2015, the algorithm we proposed not only is efficient in space but also has a good I/O performance.In aspect of engineering, in order to reduce the index size, we use crit-bit tree as the data structure of string B-tree's nodes and we store the string B-tree in the order from left to right and bottom to top by using the string B-tree's characteristic of storing the static data, so that we avoid to store the child pointers of every string B-tree's nodes explicitly. Meanwhile, we improve the external memory serialization algorithm of kd-tree and raise the space utilization of the disks which store kd-tree. In order to improve the query efficiency, we improve the suffix range query algorithm of string B-tree, and avoid the repetitive visit of the nodes which in the overlay path in the searching process. Due to the characteristic that the orthogonal range which need to be queried in the pattern matching can only exist in one divided subinterval, we promote to use multiple kd-tree as the orthogonal range searching structure of m KD-GBWT, each kd-tree only stores part of the point set, so that we can only need to use one kd-tree during the orthogonal range searching. Finally we improve the query efficiency. Furthermore, we improve the implementation algorithm of string B-tree by using the longest common prefix to reduce the building time, and improve the points splitting algorithm of kd-tree's internal nodes to make the kd-tree become more suitable for the disk serialized operation.
Keywords/Search Tags:external memory algorithm, full-text index, String B-tree, kd-tree, GBWT, mKD-GBWT, crit-bit tree
PDF Full Text Request
Related items