Font Size: a A A

Research On Generative Question Answering System Based On Multimodal Information Fusion

Posted on:2021-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:W X LiaoFull Text:PDF
GTID:2428330611467593Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The artificial intelligence boom caused by deep learning has inspired researchers to explore the question answering system through deep learning.Question answering system,as an important way of human-computer interaction,enables machines to communicate with people through human language.Since the information in the real world usually contains multiple modalities,such as video,audio,text,etc.;most of the previous research on question answering system is directed to a single modal of structured data,text or image.The question answering system based on a single modality is difficult to integrate various sources of information,and it is easy to deviate from the actual question answering scene in the understanding of questions and the generation of natural language.The establishment of a multimodal question and answer model that can process and correlate multimodal features is helpful for the interpretation and reasoning of multimodal information.The main challenge of the multimodal question answering system lies in the interactive modeling between the various modal features and questions.Due to the semantic gap between different modal information and problems,it is difficult to use the general Seq2 seq model for natural language generation of responses.This paper proposes a Multimodal Attention mechanism based Question Answering System(Mm_Att_QA).Mm_Att_QA mainly includes Encoder,scene description,Decoder three modules.(1)Encoder module: The role of the Encoder module is to extract and encode features of video,audio,historical interaction records and current problems.For video,use the migrated I3 D model for feature extraction;for audio,use the migrated VGGish for feature extraction;for historical interaction records and current problems,use word2 vec and bidirectional LSTM for feature extraction.(2)Scene description module: In order to weaken the impact of the semantic gap between video,audio and text on the performance of the question answering system.The scene description module generates the text information of the scene description from the features of audio and video through supervised learning,which is helpful to promote the fusion of video and audio features to generate the final reply.(3)Decoder module: The goal of the Decoder module is to integrate various modalfeatures through a multimodal attention mechanism according to the current problem input to generate a reply.In order to find the features related to the problem,first of all,each feature is associated with the current problem.When generating each word in the reply text,the multimodal attention mechanism is used to find the strongly associated features in each modal information.In order to balance the proportion of scenario description tasks and reply generation tasks,this paper proposes a composite loss function.The experimental results show that the multimodal information fusion generated question answering system proposed in this paper is superior to other benchmark models in multiple evaluation indicators;and the influencing factors of the model are discussed and analyzed in detail.
Keywords/Search Tags:multimodality, attention mechanism, question answering system, natural language generation
PDF Full Text Request
Related items