Font Size: a A A

Research And Implementation Of Full-text Retrieval Combining Word Matching And Context Interaction

Posted on:2022-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:Z WuFull Text:PDF
GTID:2518306761959519Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Information retrieval is a comprehensive discipline that has attracted much attention in the industry.In recent years,the rapid development of Internet scale and information resources has brought people the problem of information overload,and people are becoming more and more dependent on information retrieval.Domestic and foreign technology companies have developed their own full-text search engines,such as Baidu,Google and so on.These full-text search engines have reduced the cost of accessing effective information for everyone,and are becoming essential tools for people to filter and browse information.The goal of a full-text search engine is to filter out what users want from massive amounts of information in a short time.Full-text retrieval generally consists of two ranking steps: rough generic ranking and re-ranking.Using a simple and high recall ranking algorithm to initially filter out relevant documents from a large collection of documents,and then using one or more re-ranking methods to improve retrieval accuracy.In order to improve the accuracy of retrieval,many studies have been devoted to applying deep neural network models on re-ranking tasks of information retrieval.Experiments show that these deep neural network models achieve better performance in re-ranking,especially the pre-trained language model,which achieves the current best results on various ad-hoc retrieval benchmarks.However,the computational complexity of the pre-trained language model is quadratic with respect to the input sequence's length,when applied to ad-hoc ranking tasks,the pre-trained language model is usually only used to predict the relevance of paragraphs or individual sentences.Making pre-trained language models perform well on document-level data with limited computational cost is the key to full-text retrieval.In order to improve the retrieval accuracy without compromising the retrieval efficiency,this paper combines the traditional word matching algorithm TF-IDF with the computational idea of Vector Space Model,and proposes an improved solution for the contextualized late-interaction model Col BERT: Filters were introduced in the Col BERT model to extract the citation items with higher differentiation in the query items,and modified the way of interaction calculation to enhance the degree of relevance matching between query items and passages based on semantic matching.Passage retrieval experiments and analyses were conducted on three public datasets to verify the effectiveness of this improved scheme.In order to aggregate sequential signals between passages with full semantic understanding,this paper imitates the human reading behavior from front to back,and introduces a Gated Recurrent Unit as a feature aggregator based on the above passage retrieval model.After the query interacts with each passage and obtains the interaction feature representation,the interaction feature representation of all passages is aggregated as the interaction feature representation of the whole document using the aggregator,and the matching score between the query and the whole document is further calculated.The experimental results show that this method can effectively aggregate the sequential signals between passages,which enables it to perform well in full-text retrieval.To verify the practicality of the above full-text retrieval model,this paper constructs a high accuracy full-text search engine based on the model by independently encoding queries and documents into two sets of contextual embedding and index the documents offline.
Keywords/Search Tags:Information Retrieval, Passage Retrieval, Full-text Retrieval, Search Engine, Pretrained Language Model
PDF Full Text Request
Related items