Font Size: a A A

BERT-based Two-stage Long Document Retrieval Model Fused With Supplementary Information

Posted on:2022-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:D Y LiangFull Text:PDF
GTID:2518306554982639Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the social economy and the advancement of science and technology,the daily data generated in production and life has gradually exceeded the scale that existing tools can handle.How to extract valuable information from the massive amount of available data and how to provide more matching answers for users' queries have become a hot issue in the field of information retrieval that needs to be optimized.Although there are various descriptions for the same query and document pair,if you want to find a more matching answer,the core lies in how to better understand the meaning of the query and the document.From the perspective of deep learning,that is,how to represent them in a more meaningful vector form.Existing related practices have proven that word embedding vectors can provide richer information than traditional bag-of-words models.However,traditional word vector models such as word2 vec and fasttext cannot solve the fundamentally existing ambiguity problem,and models based on convolutional neural networks and recurrent neural networks cannot effectively model long-distance contexts.After 2018,many Transformer-based pretraining models represented by BERT perform language modeling by performing a variety of pre-training tasks on a large-scale corpus,and the resulting language model effectively solves the aforementioned problems.It opens a new door for many problems in the fields of information retrieval and natural language processing.However,due to the limitation of the input length of the pre-trained model,the related model is slightly weak in processing long documents.How to better apply BERT to the retrieval task of long documents in the field of information retrieval under the constraints of computing power and inference time,we propose a method for constructing relative importance tags for long documents,and the two-stage retrieval model composed of the important paragraph recognition model IPRM and the document reranking model Lt BERT.First,use the selected importance measurement method to label the divided paragraphs of the long document to generate a dataset of appropriate length.Then,the model trained by IPRM based on the generated dataset can focus on relatively more important paragraphs in the unknown long document.In addition,by modifying the input form of IPRM,it can better perceive the positional association between headings and paragraph terms.Finally,Lt BERT can make better use of word vector information than pure BERT,so that the effect of the model in document reranking tasks is improved to a certain extent.We conducted experiments on the Robust-04 and Clueweb-09 datasets,and the results of the MAP,n DCG@20,and P@20 evaluation indicators proved the rationality of the overall idea and the effectiveness of the model.
Keywords/Search Tags:Information Retrieval, Natural Language Processing, Document Reranking, Pre-training Model
PDF Full Text Request
Related items