Font Size: a A A

Research On Document Retrieval Based On Index Optimization And Text Snippet Mechanism

Posted on:2021-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2428330647456989Subject:Computer Science and Technology Computing Application Technology
Abstract/Summary:PDF Full Text Request
Information retrieval is a research hotspot of natural language processing.Document retrieval based on the long text is also widely used in many fields.Compared with the short text,the long text has the characteristics of information diversity and length.Besides,the narrative content may involve many theme aspects.These characteristics lead to two common problems in the research of document retrieval.Firstly,there are two types of documents which directly affect the index quality,one is a large number of irrelevant documents caused by the inefficient index words of the index table,the other is relevant documents with semantic relationships between words and these documents are often ignored.Inefficient index words refer to the words that appear in many documents in which most of them are not representative.As a result,many documents mapped with the document ID of these words in the index table are unrelated documents for specific query statement.Secondly,query statements are not always related to sentences in the long text,there will be strong interference of some highly similar segments in the retrieval.Therefore,index mechanism and correlation analysis have the further research value in the long text retrieval.This paper discusses the possible problems in the long text index and content scoring mechanism,proposes various measures to speed up the search process of fetching candidate documents by optimizing the index structure,and further screen the final results from the candidate documents by using the Text Snippet Mechanism(TSM)so as to improve the accuracy of the search results.The main research contents of this paper are as follows:(1)Aiming at solving the problems of the index words with low efficiency and lack of semantic association in the long text database,an index construction scheme is designed in this paper based on the optimization strategy of Text Rank reduction method and association rule expansion.First of all,Text Rank is used to select the word set as the index word set according to the graph structure information of the document.Then,for the selected words,the association rule mining technology is used to mine the semantic association words with them in corpus(i.e.document set)to expand the index word set.This strategy weakens the influence of inefficient index words while improves the ability to suplement semantic association phrases.(2)To solve the query efficiency problems caused by matching scale and resource call inlong text retrieval,two document retrieval methods based on block index structure are proposed in this paper: DLET and Ratio?block,which adopt the idea of early termination.The storage structure of block index is a fixed length block list.Each record is composed of document ID containing index word and correlation score data item between index word and document,The records of the block list table are sorted from high to low according to the correlation score and the upper limit value(maximum correlation score of the block list)is recorded in the index.In the retrieval,the block list is ranked according to the upper limit value and two different processing methods are are used for ranking the candidate documents in the retrieval.DLET dynamically updates the scores of the first few documents and stops searching until the results are no longer replaced.Ratio?block intercepts the block list in a certain proportion with ranking result,in which only the intercepted block list participates in the subsequent correlation calculation and sorting.The experimental results show that both DLET and Ratio?block accelerate the retrieval efficiency of candidate documents.(3)In order to solve the problems of low frequency and uniform distribution of related words that are caused by the quite long text in document retrieval,a retrieval model called Text Snippet Mechanism(TSM)whose goal is to obtain the Top-k document set is put forward.TSM firstly divides each candidate document into snippets by some segmentation rules.Next,it calculates the correlation between query statements and document snippets.In addition,filter out the key text snippets and then obtain the related snippets ratio.At last,these key snippets information are synthesized to calculate the correlation score between query and candidate document.The experimental results show that TSM improves the accuracy of retrieval model.
Keywords/Search Tags:Document retrieval, Text index optimization, Block-index strategy, Text Snippet Mechanism, Text relevance calculation
PDF Full Text Request
Related items