Font Size: a A A

Algorithms For Top-k Document Retrieval

Posted on:2018-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:D L ZhangFull Text:PDF
GTID:2348330521950902Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In this paper,we study the Top-k document retrieval problem,that is,for a given document set D={d1,d2...,dn},build an index on D,and use the relevant scoring function to score each document.So that for any given pattern P,it can return the most relevant k documents in the document set.In the natural language text,the problem mainly uses the inverted index as the retrieval model.It uses"tf-idf"as the scoring function,and according to the score to retrieve the most matching multiple text of the query keyword.However,since the inverted index requires the processed text has good word segmentation characteristics,that is,there is a clear delimiter between words and words,so for unstructured streaming text,such as DNA sequences,music sequences,etc.,inverted index cannot apply.Taking the document as a sequence of consecutive symbols,and constructing a full-text index structure on the symbol sequence,this paper implements the Top-k search on arbitrary type of document,and also supports the retrieval of any substring by constructing the full-text index structure on the symbol sequence.The work of this paper is based on the classic Top-k document retrieval framework,successfully implement and improve the framework,as follows:First,this paper presents a new construction method for generalized suffix tree for Top-k document retrieval.You can build a generalized suffix tree by concatenating a document with a different delimiter than the existing need to concatenate each document with a different delimiter into a string constructor.In this way,our constructor not only applies to large character sets,but also requires less storage space.Secondly,this paper proposes a new N-strcuture initialization method in Top-k document retrieval.By initialing the N-structure of the generalized suffix tree from bottom to top,we can avoid the need of construct a separate suffix tree for each document in the original framework,and reduce the required space.Finally,in the document retrieval phase,this paper combines the concise ordered tr ee to improve the query efficiency.Through concurring the rank/select operation on a concise ordered tree,we can get the rightmost child leaf nodes of any internal node in the tree at O?1?time,thus quickly getting the corresponding search interval.The experimental results show that compared with the classical retrieval method,th e efficiency of our method is improved by 2-3 orders of magnitude,and the retri eval of the model can be completed in microsecond time.Our engineered code can be obtained at https://github.com/Hongweihuo-Lab/Top-k-document-retrieval.
Keywords/Search Tags:document retrieval, top-k, suffix tree, generalized suffix tree, balanced parentheses sequence
PDF Full Text Request
Related items