Font Size: a A A

Research On Self-indexing Algorithms For Highly Repetitive Document Collections Based On FM-index

Posted on:2022-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y K YangFull Text:PDF
GTID:2518306563963499Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Full-text self-indexing technology refers to a data structure built on a large collection of textual materials.This structure can achieve efficient pattern string counting and location query in the original document collection;meanwhile,self-indexing differs from traditional indexing in that the original text can be restored integrally without storing the original text.Nowadays,although the huge information brings great convenience,it also brings great challenges to information retrieval:firstly,the space occupied by indexes can be large because the amount of information is too large;secondly,there are too many duplicates in the information,and many indexes cannot take advantage of the duplication of documents,making the efficiency of indexing very low.Therefore,the study of full-text self-indexing is of great importance.In this paper,we analyze algorithms such as suffix array,BWT data transformation,run-length compression,FM-index self-indexing and some latest succinct data structures,and make the following work based on them:(1)Based on the BWT plain computation method and the SAIS indirect computation method,this paper proposes the BWT-IS algorithm based on induced ranking.The algorithm first divides the original text into different L/S-type suffixes,and then calculates the partial BWT results of L-type and S-type successively by induced sorting,and finally obtains the complete BWT results of the original text after combination.The algorithm does not need to calculate the intermediate result SA,and can directly calculate the BWT from the original text.Compared with the SAIS indirect calculation method,the BWT-IS calculation time is approximately the same,and it is more memory-saving in the calculation.(2)In this paper,the FM-index is improved and the SEFM-index index is proposed.After obtaining the TBWT of the original text T,SEFM-index uses the run-length compression method to obtain T?bwt,and from this,the head character sequence S,head character bit arrays B,B?,head dictionaries C,CS,etc.and the new LF mapping is given by these structures'calculation formula.In the high repetition document collection,the run-length compressed run number r is generally much smaller than the document length n.SEFM-index can obtain the SA values of the left and right sides of the sample points by reverse search,thus compressing the overall size of the index to O(r),solving the problem that the existing index needs to add an extra set of O(n)space samples during the locating operation,and improving the overall efficiency of indexing.(3)In this paper,we summarize the repetition measure of document collection,and propose the definition and determination method of high repetition document collection.Meanwhile,the BWT plain computation,SAIS indirect computation and BWT-IS algorithms,FM-index,RLFM-index and SEFM-index indexes are implemented and two sets of comparison experiments are conducted on six high repetition document collections from different domains.The experimental results show that the speed of the BWT-IS algorithm proposed in this paper is better than the traditional BWT plain construction method.Compared with the SAIS indirect algorithm,BWT-IS can complete the calculation of BWT in nearly the same time while reducing memory consumption,and is more computationally efficient than other algorithms.In addition,the improved SEFM-index has a reduced time consumption for positioning and a smaller space occupation for indexing compared with the other two indexes,and the overall performance is improved.
Keywords/Search Tags:Full-text self-indexing, high repetition, document retrieval, FM-index
PDF Full Text Request
Related items