Font Size: a A A

Research On Large-Scale Text Retrieval Based On Representation Learning

Posted on:2024-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:S T XiaoFull Text:PDF
GTID:2568306944470544Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the digital information grows explosively,thus users are faced with the problem of information overload.In this context,information retrieval plays an important role,which can help people quickly and accurately find the information they need.Information system has become an important approach to help people relieve information overload,widely used in today’s social media,e-commerce,community forums and other network applications.Embedding-based retrieval is a popular technique in information retrieval system.It represents the text information in vector form and matches the user’s query request and documents in the corpus by calculating the similarity between two vectors.However,the existing embedding-based retrieval techniques are faced with some challenges:1)Text vector representation has a low discrimination degree,which can not distinguish similar documents,leading to poor retrieval accuracy;2)Large-scale embedding is not friendly to memory,and requires a large amount of storage overhead;3)The efficiency of text retrieval is low,which will reduce the user experience.To address these challenges,this thesis proposes a text retrieval algorithm based on representation learning in large-scale scenarios.Generate a dense representation and a compressed quantized representation for each document,where the dense representation goes to disk and only the quantized representation is loaded into memory,greatly reducing the memory footprint.At the same time,the approximate nearest neighbor index is built based on the compressed quantization representation in memory to search quickly and improve the query speed.During the query,the user’s search request is transformed into a representation vector through the text representation model.First,the rough candidate text set is retrieved through the approximate nearest neighbor index in memory,and then the dense representation of the candidate text is loaded from the disk for further accurate sorting,so as to obtain the final search result.In this framework,novel training algorithms are designed for dense representation,compressed quantization representation,and approximate nearest neighbor index:(1)Search-oriented text representation learning.Aiming at the low accuracy of text representation,a two-stage training method is used to improve the ability of representation model to distinguish similar texts.The first stage text representation is obtained based on contrastive learning,which is used for global retrieval.In order to improve the retrieval system’s ability to find the correct sample from the local similar nearest neighbor set,a difficult negative sample sampling algorithm based on bipartite graph was proposed,and the second text representation with higher accuracy was optimized according to the results of the first stage.(2)Efficient text representation compression algorithm based on product quantization.To solve the problem of unfriendly text representation memory,this paper proposes a new retrieval-oriented product quantization algorithm.Firstly,it is necessary to analyze the defects of the current universal quantization algorithm and its optimization objective theoretically and experimentally.In order to solve the disadvantages of product quantization,an end-to-end product quantization algorithm is proposed.Through the joint modeling of product quantization and text matching model,as well as the optimized training objectives,an end-to-end joint training framework is constructed.At the same time,a cross-device sampling technique with backpropagation is proposed,which can greatly improve the number of negative samples in distributed environment and further guarantee the performance of compressed vectors.(3)Approximate nearest neighbor index optimization algorithm in large-scale text retrieval scenarios.Aiming at the problem that the current index cannot achieve a good balance between accuracy and delay,a learnable optimization algorithm of nearest neighbor index is proposed,which can better solve the problem of text retrieval efficiency in large-scale scenes.This algorithm uses the relationship between the original vectors as a teacher to guide the approximate nearest neighbor index to output similar results,minimize the loss caused by the approximate nearest neighbor index,and achieve efficient and accurate text query in large-scale scenarios.
Keywords/Search Tags:information retrieval, representation model, vector compression, approximate nearest neighbor searching
PDF Full Text Request
Related items