Font Size: a A A

Research On Retrieval Method Based On Positional Relationship In Document

Posted on:2021-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:X Y WangFull Text:PDF
GTID:2428330605961397Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development of Internet has brought an explosive growth of information.How to pick out the required from the messy information is an urgent problem to be solved.Most of the existing information retrieval models evaluate the documents and select candidate expansion terms mainly in terms of the term frequency in the documents,inverse document frequency and the length of document.They may ignore the position information of the terms in the documents.Recent studies have shown that using the positional relationship of terms is an effective method to improve retrieval performance.Although these models have achieved positive results,there is still room for improvement in terms of capturing the location information of terms and measures of positional influence.Therefore,this thesis studies the information retrieval method based on the positional relationship within the documents.The main work includes the following three aspects:First,a probability-based retrieval method named BM25-LR is proposed for capturing the positional relationships within the document.The objective fact behind this research is that in most articles,the author usually summarizes their viewpoint in a specific position within the document,such as the beginning or the end of the document.These terms could represent the topic information better.To model the different positions of the terms in the document,we use the kernel method.The higher position weight will be assigned to the terms at the beginning and end of the document.Furthermore,this position feature is applied to the classic BM25 probability model,and the weight of query terms are optimized to obtain documents that are more relative to the query terms.On the five TREC data sets,we compared the BM25-LR probability retrieval method with the traditional BM25 model under the MAP and P@20 indicators.The results show that the method in this thesis has a significant increase in the MAP value on all data sets,and an increase in the P@20 value on most data sets.Second,the position feature in the document is applied to the pseudo-relevance feedback method,and a pseudo-relevance feedback method named LRoc is proposed to obtain the positional relationship within the document.This method models the different positions of candidate expansion terms in the document and assigns higher position weights to the candidate terms at the beginning and end of the document.Then the position information of the candidate terms is introduced to the traditional Rocchio model.In the selection and evaluation of candidate terms,the proposed method not only considers the importance of the term frequency,but also the impact of the terms' position,and then obtains extended terms that are more relative to the original query.On five TREC data sets,we compare the LRoc method with the traditional Rocchio model under the MAP and P@20 indicators.The experimental results show that the MAP and P@20 values of the proposed method are significantly improved on all the data sets.Third,this thesis designs and implements an information retrieval system based on the positional relationship within the document.The system uses the classic MVC design pattern to design and implement six functional modules.The users input a query to perform a search based on the search requirements,and the system returns the retrieval results and displays the added query expansion terms.The feasibility and validity of the proposed model could be intuitively tested by reading the top-ranked documents and expansion terms.
Keywords/Search Tags:Information retrieval, Location influence, Probability-based retrieval model, Pseudo-relevance feedback, Query expansion
PDF Full Text Request
Related items