Font Size: a A A

Research On Question Deep Semantic Matching For Community Question Answering

Posted on:2021-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:2428330605474885Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Question matching,also known as similar question identification,is an important task in community question answering(CQA).It can effectively leverage the existing question-and-answer pair dataset of CQA to enhance the user experience.In CQA,similar ques-tion identification aims to identify questions from the question-and-answer pair dataset{q1,q2,…,qn} that are semantically similar to the user's question qO,and return the corre-sponding answer to the user.Usually,the similar question identification in CQA is divided into two stages:recall and reordering.First,the CQA considers the issue of timeliness,us-ing methods such as retrieval to recall the top k similar questions to the user's question from the large dataset.Second,based on the recall data,the CQA uses a binary question matching model to perform k-match identification(i.e.Paraphrase Identification)for reordering.In this process,there are three problems with question matching in CQA·In the sorting process,the complexity of the binary question matching model at the frontier is high and difficult to train,leading to low timeliness.·The accuracy rate of recall in the CQA using the retrieval method is not enough,which leads to the problem of wrong transmission.·In CQA,there are cross-language similar question identification scenarios.However,related fields lack cross-language similar question identification corpusTo address the above three questions,this paper based on the basic structure of ques-tion matching,combined with the deep neural network model to semantically encode the question,and conducted the following three parts of the research:(1)Similar Question Identification based on Multi-Convolution Self-Interaction MatchingExisting approaches typically examine similar question identification as a paraphrase identification task.They often construct complex network models to encode representations of the deep semantics of natural questions.This makes the model high in complexity,dif-ficult to train and slow to run.To solve this problem,a lightweight multi-convolution self-interaction matching method is proposed in this paper.This approach obtains rich word-level semantic representations by fusing different sentence features with lexical features.Then it uses convolution neural networks to capture phrase-level semantic representations,and fuses word-level and phrase-level semantic representations through a multi-convolution self-interaction fusion method to obtain multi-grained sentence semantic information.This paper uses Quora corpus for experimental analysis.The experimental results show that the proposed method has comparable performance to the benchmark model,but it has higher superiority in space complexity and faster training speed.Specifically,the physical memory required for the method training is reduced by 80%compared with the benchmark model,and the training iteration speed is 19 times faster.(2)Semantic Space Distance Method for Similar Question IdentificationExisting research typically identifies"one-to-one" similar questions between two natu-ral questions,which is different from the actual application scenario of "one-to-many" recall of similar questions in CQA.Considering the question of overall timeliness,CQA is usually a quick recall of similar questions using retrieval methods.In the process,the accuracy of the recalled data is not high,which leads to the problem of wrong transmission.In response to this problem,this paper,inspired by the face identification task,proposes a similar ques-tion identification method based on semantic spatial distance.The method treats the similar question identification task as a multi-category classification task at the time of training,and a semantic encoder is obtained by training.In test,the semantic encoder is used to map all questions to vector representations in the same semantic space,and similar questions are identified by the distance between the vectors.In this paper,the effectiveness of the method proposed in this paper is experimentally verified by constructing relevant corpus data using race data from Biendata.The method is 5%better than the baseline method on multiple performance metrics.(3)Automatically Constructing Cross-lingual Similar Question Corpus Method based on Web DataSimilar question identification scenarios exist across languages in the CQA.This sce-nario requires cross-lingual similar question identification corpus to drive related research,but there is a lack of cross-lingual corpus specifically for similar question identification.In response to this problem,this paper proposes a method for automatically constructing cross-linguistically similar questions corpus based on web data.The method uses rules and language models to filter low-quality data by crawling user questions in a large Chinese community Baidu Zhi Dao,while using neural network translation models to obtain cor-responding English question data.Finally,the method constructs a larger cross-language corpus of similar Chinese and English questions using two corresponding Chinese and En-glish questions data.In this paper,several cross-lingual Chinese-English similar question identification methods were experimented on the constructed cross-lingual Chinese-English similar question corpus,in which the XLM benchmark model achieved an accuracy rate of 90.45%on the corpus,promoting the development of cross-lingual similar question identi-fication research.
Keywords/Search Tags:Community Question Answering, Paraphrase Identification, Similar Question Matching, Multi-convolution Interaction, Semantic Encoder
PDF Full Text Request
Related items