Font Size: a A A

Study On The Algorithm For Full-Text Self-indexes Compression

Posted on:2019-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:W Y GuoFull Text:PDF
GTID:2348330542487554Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet technology,network information has increased explosively.Complex text data brings convenience to people,but also brings huge challenges to text retrieval.Inverted index technology can solve some of the needs,but it causes problems in the accuracy of the retrieval when word segmentation is inaccurate or can't be carried out.The full-text self-indexes algorithm segments the text at the granularity of single symbol instead of word,which can solve the problem of exact matching.The space occupied by the full-text self-indexes is 4-20 times that of the original text,resulting in a great waste of space.Therefore,the study on the algorithm for full-text self-indexes compression is of great significance for full-text retrieval.In this paper,we study the knowledge of suffix array,rank/select/access operation,BWT data rotation algorithm,wavelet tree and integer coding compression algorithms.On this basis,we design efficient algorithms for full-text self-indexes compression.The main work is as follows:?1?Based on the Sad-CSA algorithm,this paper proposes PEF-CSA self-indexes compression algorithm of exploiting the concept of context partition and preserving a layer of context structure.The algorithm uses the Partitioned-Elias-Fano coding compression algorithm to compress the intermittent monotone increasing neighbour array ? transformed by the suffix array,and achieves good compression effect and query performance with the two-level compression structure.?2?Based on the original FM-Index algorithm,this paper proposes Adaptive-FM-Index self-indexes compression algorithm.It turns the original text T into Tbwt after BWT data rotation,and uses the Huffman wavelet tree structure to store Tbwt to get HWT(Tbwt).The bit sequence stored in each node of HWT(Tbwt)is divided into the two-level structure of the super-block and the block to improve the query speed.According to the data distribution characteristics of blocks,the adaptive coding method is selected to greatly improve the compression performance.We combine the sampling suffix array and sampling rank array as auxiliary structure to gain an efficient self-indexes structure.?3?In this paper,we implement the PEF-CSA self-indexes compression algorithm,the Sad-CSA compression algorithm,the RL-CSA compression algorithm and the SDSL-CSA algorithm.Experimental results suggest that PEF-CSA self-indexes compression algorithm performs best amongst the CSA algorithms in the compression ratio and count query performance.The locate query performance of the proposed algorithm is better than most CSA algorithms.We implement the adaptive-FM-Index self-indexes compression algorithm,the FM-RRR algorithm,FM-uncompressed algorithm,FM-hybrid algorithm and RLFM algorithm.The results illustrate that Adaptive-FM-Index self-indexes compression algorithm has better compression ratio,count query performance and locate query performance than other FM-Index algorithms,and better compression in dataset with unbalanced character frequency.The compression ratio of Adaptive-FM-Index self-indexes compression algorithm is better than PEF-CSA algorithm.But for the balanced dataset like english dataset,the compression ratio of the PEF-CSA algorithm is lower than Adaptive-FM-Index algorithm.The locate query performance of PEF-CSA self-indexes compression algorithm is better than Adaptive-FM-Index self-indexes compression algorithm.
Keywords/Search Tags:Suffix Array, Full-Text Self-Indexes, Index Compression, PEF-CSA, Adaptive-FM-Index
PDF Full Text Request
Related items