Open-domain Question Answering is a task which uses natural language to answer general-domain questions.It is one of the core problems in information retrieval and natural language processing.Most existing studies divide this problem into a few stages,including document retrieval,document ranking and machine reading comprehension.Document retrieval tries to retrieve documents relevant to the question from a large text corpus.Document ranking tries to re-rank the retrieved documents by the document retrieval subtask.Machine reading comprehension tries to extract the final answer from the re-ranked documents by the document ranking subtask.In recent years,self-attention based pretraining models are widely used in open-domain question answering,which also bring high computing and memory cost.This paper applies hash learning to different stages of open-domain question answering.Three innovative contributions are outlined below:Existing information retrieval methods mostly use TF-IDF or BM25 algorithm.These algorithms are based on direct keywords matching,which cannot capture semantic information.To solve this problem,this paper proposes a hashing based query expansion model(HQE),which rewrites the question and improves the effiency of query expanding via hash learning.Experiments show that HQE model is able to obtain higher recall in multiple datasets,compared to existing approaches.Document ranking models which use pretrained self-attention networks as their encoders have computing effiency and memory cost issue.We propose a hashing based passage re-ranking(HPR)model,which learns the binary matrix representation of each candidate document.When used for online prediction,the model stores the matrix in the memory to prevent recalculation,which also reduces the memory cost.Experimentson three datasets show that HPR outperforms existing models and achieves the stateof-the-art performance.Existing reading comprehension models mostly use pretrained self-attention models to get the contextual semantic representation of documents and questions,which also have computing effiency and memory cost issue.Considering other candidate documents while predicting a document's answer can help improve the model's performance,but brings much more memory cost.To tackle this problem,this paper presents a hashing based multi-document reading comprehension model(HMRC),which predicts the answer by multiple iterations.HMRC learns the binary representation of the candidate documents to reduce the memory cost.Experiments on three open-domain QA datasets show that our model achieves the state-of-the-art performance. |