BERT-based Two-stage Long Document Retrieval Model Fused With Supplementary Information

Posted on:2022-09-27

Degree:Master

Type:Thesis

Country:China

Candidate:D Y Liang

Full Text:PDF

GTID:2518306554982639

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the social economy and the advancement of science and technology,the daily data generated in production and life has gradually exceeded the scale that existing tools can handle.How to extract valuable information from the massive amount of available data and how to provide more matching answers for users' queries have become a hot issue in the field of information retrieval that needs to be optimized.Although there are various descriptions for the same query and document pair,if you want to find a more matching answer,the core lies in how to better understand the meaning of the query and the document.From the perspective of deep learning,that is,how to represent them in a more meaningful vector form.Existing related practices have proven that word embedding vectors can provide richer information than traditional bag-of-words models.However,traditional word vector models such as word2 vec and fasttext cannot solve the fundamentally existing ambiguity problem,and models based on convolutional neural networks and recurrent neural networks cannot effectively model long-distance contexts.After 2018,many Transformer-based pretraining models represented by BERT perform language modeling by performing a variety of pre-training tasks on a large-scale corpus,and the resulting language model effectively solves the aforementioned problems.It opens a new door for many problems in the fields of information retrieval and natural language processing.However,due to the limitation of the input length of the pre-trained model,the related model is slightly weak in processing long documents.How to better apply BERT to the retrieval task of long documents in the field of information retrieval under the constraints of computing power and inference time,we propose a method for constructing relative importance tags for long documents,and the two-stage retrieval model composed of the important paragraph recognition model IPRM and the document reranking model Lt BERT.First,use the selected importance measurement method to label the divided paragraphs of the long document to generate a dataset of appropriate length.Then,the model trained by IPRM based on the generated dataset can focus on relatively more important paragraphs in the unknown long document.In addition,by modifying the input form of IPRM,it can better perceive the positional association between headings and paragraph terms.Finally,Lt BERT can make better use of word vector information than pure BERT,so that the effect of the model in document reranking tasks is improved to a certain extent.We conducted experiments on the Robust-04 and Clueweb-09 datasets,and the results of the MAP,n DCG@20,and P@20 evaluation indicators proved the rationality of the overall idea and the effectiveness of the model.

Keywords/Search Tags:

Information Retrieval, Natural Language Processing, Document Reranking, Pre-training Model

PDF Full Text Request

Related items

1	Research Of Chinese Information Retrieval System And Document Reranking
2	Research On Information Retrieval Based On Language Model And Reranking For Retrieval Results
3	Research On Machine Reading Comprehension Model Based On Passage Reranking And Hierarchical Information
4	Research And Application Of Natural Language Processing In Information Retrieval
5	Research And Application Of Document Semantic Representation Method
6	Research On Code Retrieval Technology Based On Extended Query And Natural Language Processing
7	Research On Machine Learning For Natural Language Processing And Transmission
8	Using Statistical Language Modeling For Ad Hoc Information Retrieval
9	Design And Implementation Of Fact Checking System Based On Natural Language Processing Technology
10	An application of case relations to document retrieval