Algorithms For Top-k Document Retrieval

Posted on:2018-12-13

Degree:Master

Type:Thesis

Country:China

Candidate:D L Zhang

Full Text:PDF

GTID:2348330521950902

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In this paper,we study the Top-k document retrieval problem,that is,for a given document set D={d₁,d₂...,d_n},build an index on D,and use the relevant scoring function to score each document.So that for any given pattern P,it can return the most relevant k documents in the document set.In the natural language text,the problem mainly uses the inverted index as the retrieval model.It uses"tf-idf"as the scoring function,and according to the score to retrieve the most matching multiple text of the query keyword.However,since the inverted index requires the processed text has good word segmentation characteristics,that is,there is a clear delimiter between words and words,so for unstructured streaming text,such as DNA sequences,music sequences,etc.,inverted index cannot apply.Taking the document as a sequence of consecutive symbols,and constructing a full-text index structure on the symbol sequence,this paper implements the Top-k search on arbitrary type of document,and also supports the retrieval of any substring by constructing the full-text index structure on the symbol sequence.The work of this paper is based on the classic Top-k document retrieval framework,successfully implement and improve the framework,as follows:First,this paper presents a new construction method for generalized suffix tree for Top-k document retrieval.You can build a generalized suffix tree by concatenating a document with a different delimiter than the existing need to concatenate each document with a different delimiter into a string constructor.In this way,our constructor not only applies to large character sets,but also requires less storage space.Secondly,this paper proposes a new N-strcuture initialization method in Top-k document retrieval.By initialing the N-structure of the generalized suffix tree from bottom to top,we can avoid the need of construct a separate suffix tree for each document in the original framework,and reduce the required space.Finally,in the document retrieval phase,this paper combines the concise ordered tr ee to improve the query efficiency.Through concurring the rank/select operation on a concise ordered tree,we can get the rightmost child leaf nodes of any internal node in the tree at O?1?time,thus quickly getting the corresponding search interval.The experimental results show that compared with the classical retrieval method,th e efficiency of our method is improved by 2-3 orders of magnitude,and the retri eval of the model can be completed in microsecond time.Our engineered code can be obtained at https://github.com/Hongweihuo-Lab/Top-k-document-retrieval.

Keywords/Search Tags:

document retrieval, top-k, suffix tree, generalized suffix tree, balanced parentheses sequence

PDF Full Text Request

Related items

1	Finding MUMs With Enhanced Suffix Arrays
2	An Algorithm Based On Suffix Tree For Identification Of Repeats In DNA Sequence
3	Research On Construction Of Index Structure For Biological Sequences
4	Research And Implementation Of Distributed And Parallel Algorithm For Large-Scale Generalized Suffix Tree Construction
5	Research On Web Document Clustering Approaches Based On Phrase Features
6	Multi-pattern Matching With Wildcards Based On Suffix Tree And Suffix Array
7	Research Of Finding Maximal Unique Matches In Genome
8	Research On Vietnamese News Topic Recognition Method Based On Suffix Tree Clustering Algorithm
9	Research On Temporal XML Index Based On Suffix Tree
10	Research Of A Suffix Tree Based Automatic Wrapper Generation Method