Font Size: a A A

A Research Of Deep Learning Based Multi-modal Question Answering

Posted on:2021-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:A LiuFull Text:PDF
GTID:2428330623967813Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,deep learning has driven the rapid development of natural language processing(NLP)and computer vision(CV).A very popular topic in the field of NLP is question answering,which requires machines to answer questions in the form of natural language automatically.In addition to the traditional text question answering,there is a variant that provides multimodal data(such as text and images),that is,a multimodal question answering task.This task poses new challenges on how to combine multimodal data and perform corresponding inference strategies.This article studies a branch of multi-modal question answering,called Multi-Modal Machine Reading Comprehension(MMMC).MMMC is a multi-modal extension of machine reading comprehension(MRC).MRC requires reading and understanding an article and answering questions based on the content of the article.However,in MMMC,the reading background has become a multi-modal form,like a text with corresponding pictures.The questions are also not limited to text,but can also be composed of images.MMMC has various types of problems,such as cloze filling,multiple choice or sorting.The most recent MMMC dataset is RecipeQA,and this dataset has published four different MMMC tasks.In this article,we conducted in-depth research on MMMC tasks and proposed novel deep learning models that can solve multiple task styles.We performed experiments on all four different tasks of RecipeQA and achieved state-of-the-art results.As the previous MRC work did not introduce temporal order information,we propose an order-oriented deep learning model to process temporal order information in unimodal MRC.We reconstruct the textual cloze task of RecipeQA and expand it into an activity ordering task,which requires a series of activity phrases to be sorted according to the context artical.We propose an OrdMatch model,which has two main modules: a hierarchical matching module and an attention-based ordering regularization term.Experimental results show that our model can effectively learn the temporal order information in MRC and help text matching.In addition,we explored the different task forms of the MMMC,specifically the RecipeQA dataset.After deep investigation,we found that there is no state-of-the-art model to explore the task form of RecipeQA.We mainly research on two task forms,one is machine reading comprehension with multi-modal context,that is,the context is in multi-modal form,and the other is machine reading comprehension with multi-modal problem,that is,the problem and context present different modals.In the problem of multi-modal context,we propose a multi-modal neural tensor network(MM-NTN)based on the neural tensor network,and calculate the triplet correlation of <document,images,answer>.Compared to the previously mentioned OrdMatch model,this model achieves better results.Aiming at the task of multi-modal questions,that is,text context and visual questions(answer),we propose a Multi-Level Multi-Modal Transformer(MLMM-Trans)architecture,which is proposed based on the multi-headed attention mechanism.It can extract features separately at the step level and the document-image level.Its key contribution is to provide a general architecture for multimodal fusion of multiple sentences and multiple images.The model has obtained the most advanced results on multiple tasks,showing its effectiveness on MMMC.
Keywords/Search Tags:multi-modal machine reading comprehension(MMMC), machine reading comprehension(MRC), attention mechanism
PDF Full Text Request
Related items