Open-domain question answering(ODQA)is a research hotspot in the field of natural language processing.Its goal is to extract information relevant to a user’s question from a massive corpus of text and ultimately produce an answer to the question.With the continual development of deep learning techniques,particularly the recent advancement in machine reading comprehension,the "retriever-reader" architecture has become the mainstream approach for open-domain question answering systems.The retriever is responsible for fnding a set of candidate passages from the massive data source that are relevant to the user’s question,while the reader is responsible for understanding these candidate passages and ultimately extracting or generating the corresponding answer.The research presented in this thesis is based on the "retriever-reader" framework,targeting massive data sources of text and free-form tables.Specifically,our research focuses on:(1)Proposing a candidate selection method that combines coarse-grained and finegrained granularity for text-based data sources.In the "retriever-reader" framework,the performance of the reader is heavily dependent on the quality of the candidate passages retrieved by the retriever.Additionally,exploring an integrated solution for question answering by blurring the boundaries between the retriever and reader is beneficial.In the proposed candidate selection method,the "coarse-grained" selection model is responsible for selecting the Top-K relevant passages to the question,while the "fine-grained" model,based on contrastive learning strategies,is responsible for selecting sentences within the passages to remove sentence noise and provide better input data for the reader.Relevant experiments demonstrate that this method effectively improves the performance of ODQA systems.(2)Proposing an answer extraction method based on multi-level semantic matching for free-form table data sources.Free-form tables contain not only text-based information but also two types of additional information:titles and row/column headers that to some extent reflect the overall semantic information of the table,and complete records corresponding to a whole row/column with clear structural characteristics.In the proposed multi-level semantic matching method,a pseudo-question is generated based on the titles and row/column headers to achieve table-level semantic matching through semantic matching between the pseudo-question and the user’s question.The method then performs semantic matching from coarse-to-fine levels,namely row/column-level and cell-level matching,and ultimately obtains the answer at the cell-level.Relevant experiments demonstrate that this method effectively achieves answer extraction for free-form tables.(3)Proposing a method that combines retriever based on aligned blocks and cross-block reader based on a long sequence Transformer model for mixed text and free-form table data sources.Firstly,multiple text fragments are aligned with a table using entity linking technology to form aligned blocks that combine the table and text fragments.Secondly,retriever is performed at the level of aligned blocks to obtain the Top-K aligned blocks.Finally,the global and local sparse attention mechanisms are combined,and a long sequence Transformer model is used to perform cross-block reading and extract answers.Relevant experiments demonstrate that this method better utilizes the complementary nature between tables and text to improve the performance of question answering systems. |