Research On Self-indexing Algorithms For Highly Repetitive Document Collections Based On FM-index

Posted on:2022-02-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y K Yang

Full Text:PDF

GTID:2518306563963499

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Full-text self-indexing technology refers to a data structure built on a large collection of textual materials.This structure can achieve efficient pattern string counting and location query in the original document collection;meanwhile,self-indexing differs from traditional indexing in that the original text can be restored integrally without storing the original text.Nowadays,although the huge information brings great convenience,it also brings great challenges to information retrieval:firstly,the space occupied by indexes can be large because the amount of information is too large;secondly,there are too many duplicates in the information,and many indexes cannot take advantage of the duplication of documents,making the efficiency of indexing very low.Therefore,the study of full-text self-indexing is of great importance.In this paper,we analyze algorithms such as suffix array,BWT data transformation,run-length compression,FM-index self-indexing and some latest succinct data structures,and make the following work based on them:(1)Based on the BWT plain computation method and the SAIS indirect computation method,this paper proposes the BWT-IS algorithm based on induced ranking.The algorithm first divides the original text into different L/S-type suffixes,and then calculates the partial BWT results of L-type and S-type successively by induced sorting,and finally obtains the complete BWT results of the original text after combination.The algorithm does not need to calculate the intermediate result SA,and can directly calculate the BWT from the original text.Compared with the SAIS indirect calculation method,the BWT-IS calculation time is approximately the same,and it is more memory-saving in the calculation.(2)In this paper,the FM-index is improved and the SEFM-index index is proposed.After obtaining the T_BWT of the original text T,SEFM-index uses the run-length compression method to obtain T?bwt,and from this,the head character sequence S,head character bit arrays B,B?,head dictionaries C,CS,etc.and the new LF mapping is given by these structures'calculation formula.In the high repetition document collection,the run-length compressed run number r is generally much smaller than the document length n.SEFM-index can obtain the SA values of the left and right sides of the sample points by reverse search,thus compressing the overall size of the index to O(r),solving the problem that the existing index needs to add an extra set of O(n)space samples during the locating operation,and improving the overall efficiency of indexing.(3)In this paper,we summarize the repetition measure of document collection,and propose the definition and determination method of high repetition document collection.Meanwhile,the BWT plain computation,SAIS indirect computation and BWT-IS algorithms,FM-index,RLFM-index and SEFM-index indexes are implemented and two sets of comparison experiments are conducted on six high repetition document collections from different domains.The experimental results show that the speed of the BWT-IS algorithm proposed in this paper is better than the traditional BWT plain construction method.Compared with the SAIS indirect algorithm,BWT-IS can complete the calculation of BWT in nearly the same time while reducing memory consumption,and is more computationally efficient than other algorithms.In addition,the improved SEFM-index has a reduced time consumption for positioning and a smaller space occupation for indexing compared with the other two indexes,and the overall performance is improved.

Keywords/Search Tags:

Full-text self-indexing, high repetition, document retrieval, FM-index

PDF Full Text Request

Related items

1	Study Of Indexing Techniques For Encrypted Full-Text Retrieval System
2	Research On Some Key Techniques Of Document Database
3	Research And Implementation Of Distribute Massive Text Data Index And Retrieval System
4	Research On Optimization Of Indexing Algorithm In Full-text Retrieval
5	The Design And Implement Of Fast Indexing Files Structure And Full-text's Information Retrieval System
6	Multi-document Full-text Retrieval System Design And Implementation
7	Research And Implementation Of An Open High-Performance Platform Of Full-Text Retrieval
8	The Full-text Indexing Technology Index Merge Algorithm Research And Analysis
9	Research On Dynamic Indexing Technologies In Full-text Retrieval System
10	The Research Of Full-Text Retrieval And Its Relative Security Technology For Chinese