Font Size: a A A

Research On Multi-mode Question Answering Method

Posted on:2023-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z W BaiFull Text:PDF
GTID:2558307163489684Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Traditional question answering algorithms are mainly oriented to single mode scenarios,such as Xiaoai,Siri,and customer service question answering robot.They only contain text or audio mode information in questions and answers.The traditional question answering algorithm mainly has the following problems: 1)it cannot answer the multimodal questions raised by users,such as asking questions for a certain table or asking questions for a certain picture;2)For the questions raised by users,multi-modal answers cannot be displayed,that is,the answers without words and pictures.In view of the above two problems,this paper designed a visual question answering model based on neural cellular automata and a multi-modal question answering model based on reading comprehension.The specific research contents are as follows.(1)For the scene where users ask questions to images,this thesis proposes a visual question and answer model based on neural cellular automata,and gives users accurate answers based on the understanding of visual and text modal information.In this model,a question coding module is constructed to convert question statements into text semantic vector.According to the user input image,an image coding module is constructed to extract the visual object in the image and convert it into visual semantic vector.To solve the problem of cross-modal alignment between visual and text,the text semantic vector and visual semantic vector were constructed into multi-modal cell graph,and the multimodal fusion vector was generated by using the self-defined cell "birth and death rule".The final answer prediction uses the classification layer as a decoder to select the best match from the candidate answers.Experiments on the visual question answering dataset show that the visual question answering method based on neural cellular automata has a certain improvement in accuracy and interpretability compared with the existing traditional neural network methods based on CNN or RNN.(2)For the answers with words and pictures,this thesis proposes a multi-modal question and answer model based on reading comprehension,which provides users with a answer with words and pictures on the basis of understanding the semantics of the question.In this model,two text answer generation methods are designed for users’ questions.The first method is "selective text answers",which converts questions and paragraphs into joint embedding vectors and predicts text answers by using sequence annotation model.The second method is "generative text answer".This method uses two stages of training,unsupervised pre-training of paragraph corpus and fine-tuning training of question and answer data respectively,and uses generative model to predict text answer.For the generation of image answers,the question and text answers are converted into joint embedding vector,and the appropriate image answers are matched in candidate images by contrast learning.Experiments on multi-modal reading comprehension data sets show that the multi-modal question answering model based on reading comprehension can generate answers with words and pictures for users,and has a certain improvement in accuracy and answer richness compared with existing methods.
Keywords/Search Tags:Multimodal, Image Text Matching, Cellular Automata, Graph Neural Networks
PDF Full Text Request
Related items