Font Size: a A A

Multimodal Natural Language Generation For Human-computer Interaction

Posted on:2020-04-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:F LiuFull Text:PDF
GTID:1368330626950340Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Multimodal natural language generation has become one of the most popular areas of research in computer vision and natural language processing.It has a wide range of applications in human-machine interaction and intelligent robotics.In this study,we focus on developing algorithms that enable agents to visually sense the world and perform vision tasks specified by the human,taking natural language instructions to generate corresponding textural responses.For a practical purpose,the agent should be able to perform three tasks:image annotation which describes the image via keywords or natural language;visual question answering(VQA)which answers visual questions asked by the human;and visual question generation(VQG)which asks visual questions for humans to answer.For the three tasks,the CNN-RNN hybrid network is one of the most fundamental approaches,where CNN is responsible for the encoding of the visual content,and RNN is responsible for the encoding and decoding of the textual content.From a training perspective,RNN can be seen as a very deep neural network,so the vanishing gradient problem becomes a main issue in the training of such hybrid networks,making the whole training process unstable and slow.Even though strategies like stagewise training has been proposed,there is still no effective solution to this problem.The other problem in visuo-linguistic understanding is the language bias problem,where the model overly exploits the correlation between the textural input and the output,thus making predictions without a full understanding of the image content.Recent studies show that most VQA systems suffers from this effect,even though they can achieve high scores on the dataset.This is caused by the large domain gap between image and natural language,making it hard for VQA agents to utilize visual information.It makes model design difficult as better designs may not achieve high scores due to training and evaluation issues.So in the study,we first study the training difficulty in CNN-RNN hybrid networks with semantic regularization as a foundation for the following research;then we attack the problem of inverse visual question answering(iVQA)as a less biased visuo-linguistic problem;we finally use the iVQA model for the belief set diagnosis of VQA models.More specifically,our contributions can be summarized as follow:(1)We propose a semantically regularized CNN-RNN framework for image annotation.Most CNN-RNN based approaches use CNN hidden layers as the interface between CNN and RNN,which have no explicit semantical meanings.As a result,the RNN is overstretched for shouldering two tasks:semantic concept prediction and concept relation modeling.Meanwhile,the sequence prediction loss at the end of RNN becomes the only source of supervision,and the gradient has to pass through the whole RNN to reach CNN.As a result,the end to end training of the whole network is problematic making the CNN features even worse.In this paper,we introduce a regularization term that forces the hidden units in the interface layer to predefined semantic concepts.The advantage is that it decouples the concept prediction and relation modeling task,where the former is now conducted by CNN,and the latter is performed by RNN.It also introduces auxiliary supervision in the middle of the network,which provides a strong gradient to guide the training of CNN.Extensive experiments on both keyword based description and natural language description demonstrated the effectiveness of the proposed method.(2)We propose a problem of inverse visual question answering(iVQA),which is a new setting of generating questions.Unlike conventional VQG,it aims to generate questions con?ditioned on both image and keyword.Taking certain belief about an image as the keyword condition,the agent can generate a corresponding query question.By taking in human re?sponse,the agent can refine its original belief.So it is more practical than VQG.In our preliminary study,we use answers in the VQA dataset as the keyword to generate the questions to the answer.This setting is similar to VQA in that both take visual and textual inputs to generate textual outputs.However,iVQA is a less biased problem,since the information in the keyword is very limited compared with questions.The agent must turn to the image for cues to generate correct questions,so it requires more understanding of the image.It makes iVQA more suitable as a visuo-linguistic benchmark.In this study,we propose an iVQA model based on multi-modal attention,where the model dynamic changes its fixation about the image during the decoding process to provide better visual features.The mapping from image and keyword to question is actually one to many,making linguistic measures becomes inaccurate due to insufficient annotation.So we propose a ranking based evaluation metric which correlates well with human study scores.Extensive experiments on the dataset demonstrate the effectiveness of the proposed model as well as the accuracy of the new metric.(3)We propose an iVQA model based on conditional variational auto-encoders,which improves the training efficacy and enables diverse question generation.The variational iVQA model consists of two parts:an encoder which encodes the question to the distribution of hidden vectors,a decoder which takes the hidden vector,image,and keyword as inputs to produce the question.As the style and topic are encoded by the hidden vectors,the question generation during training time becomes a one to one mapping,thus making the training less ambiguous and more efficient.During test time,the hidden vectors can be sampled from its prior distribution,with which the encoder can generate diversified questions.We test the proposed model on the VQA dataset,and results show that the new model can generate questions of varying topics while still strictly conditioned on the image and keyword.(4)We propose a belief set based diagnosis and evaluation method for VQA models.The belief set of VQA models is defined as the image-question-answer tuples which the VQA model believes to hold.By examining the examples in the belief set,the cons and pros of different VQA models can be compared.Unlike the conventional approach for VQA analysis,the proposed belief set method can take advantage of new questions which satisfy the belief of the VQA model.It is made possible by our study of iVQA models.To effectively constructing the belief set,we employ a reinforcement learning based learning paradigm for iVQA model training,aiming to maximize the score of the VQA model by generating appropriate questions.We validated our approach through extensive experiments and performed studies on some leading VQA models.Our analysis shows that current VQA models have more wrong beliefs than expected,and reducing language bias should be a main direction in future study.
Keywords/Search Tags:multi-label image classification, image captioning, inverse visual question answering, visual attention, reinforcement learning, belief set
PDF Full Text Request
Related items