Font Size: a A A

Language Understanding And Generation For Multimodal Human-Computer Interaction

Posted on:2022-06-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q B HuangFull Text:PDF
GTID:1488306569459654Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Multi-modal human-computer interaction has become a research hotspot in natural language processing and computer vision communities.This thesis focuses on the study of language understanding and generation with visual grounded.Considering typical human-computer interaction scenarios,an intelligent interactive system should have three basic capabilities: to answer,ask,and narrate.The purpose of this thesis is to research how to improve multimodal human-computer interaction capabilities in the context of“vision-and-language”.Therefore,in pursuit of this overarching goal,we accordingly focus on the following tasks: visual question answering(VQA),question generation(QG),visual question generation(VQG),story endings generation(SEG)and image guided story endings generation(Ig SEG).To address the problems existing in the process of multimodal text generation,such as the lack of in-depth capture of the structure information in text and images,we proposes to capture the structural information,e.g.,syntactic dependency in text and relation graph in image by using graph convolutional neural network.Our contributions can be summarized as follows:1.VQA aims to answer the natural language question about a given image.Existing graph-based methods only focus on the relations between objects in an image and neglect the importance of the syntactic dependency relations between words in a question.To simultaneously capture the relations between objects in an image and the syntactic dependency relations between words in a question,we propose a novel dual channel graph convolutional network(DC-GCN)for better combining visual and textual advantages.2.Previous QG models have two main shortcomings: First,previous work did not simultaneously capture the sequence information and structure information hidden in the context,which results in poor results of the generated questions.Second,the generated questions cannot be answered by the given context.To tackle these issues,we propose an entity guided question generation model with contextual structure information and sequence information capturing.We use a GCN and a bidirectional long short term memory network to capture the structure information and sequence information of the context,simultaneously.In addition,to improve the answerability of the generated questions,we use an entity-guided approach to obtain question type from the answer,and jointly encode the answer and question type.3.VQG aims to generate a question given an image.Previous VQG work mainly focused on the shallow image semantic information,ignoring the high-level semantic information represented by the original image,such as relations and events.To generate more specific and semantic-rich questions,the model should accurately location the target object and grasp the relations between the questioned object and other objects around it.To enhance the quality of generated question,we propose a GCN-based model along with an answer and question type encoding module to generate questions.4.SEG aims at generating a reasonable and coherent ending for a given story context.The key challenge of the task is to comprehend the context sufficiently and capture the hidden logic information effectively,which has not been well explored by most existing generative models.To tackle this issue,we propose a context-aware Multi-level GCN over dependency parse(MGCN-DP)trees to capture dependency relations and context clues more effectively.We utilize dependency parse trees to facilitate capturing relations and events in the context implicitly,and Multi-level GCNs to update and deliver the representation crossing levels to obtain richer contextual information.5.To make the generated endings more semantically-informative,imaginative,and goal-directed,we propose a new task called Ig SEG.Given a multi-sentence story plot and an ending-related image,Ig SEG aims to generate a story ending that conforms to the contextual logic and the relevant visual concepts.In contrast to SEG,which generates open-ended endings,the major challenges of Ig SEG are to comprehend the given context and image sufficiently,and mine the appropriate semantics from the image to make the generated story ending informative,reasonable,and coherent.To address the challenges,we propose a Multi-layer Graph convolution and Cascade-LSTM(MGCL)based model.To sum up,this thesis focuses on common tasks in multi-mode human-computer interaction such as questioning,answering and narrating.The proposed models take advantage of the strong aggregating ability of graph convolutional network to conduct aggregating operation on syntactic dependency trees of the text and object-level relation graphs(e.g.,scene graphs)of the image.Through in-depth mining the dependency relations intra and inter-sentence in the text and the relations between the objects in the image,the model can understand the input information of text and images more comprehensively,align and merge between modalities more precisely,thereby enhance the quality of the generation and improve the effect of multimodal human computer interaction.The research reveals that structural information in the text,such as intra-sentence and inter-sentence dependency relation,is conducive to the in-depth understanding of the text and the realization of fine-grained text generation.It is also shown that injecting symbol representation into neural networks is beneficial to improving the quality of text generation.Graph convolution network over syntactic dependency trees and object-level relation graphs(e.g.,scene graphs),is conducive to promoting the alignment and fusion between language and visual modalities,and provides a way for the study of multimodal human-computer interaction language understanding and generation.
Keywords/Search Tags:multimodal human-computer interaction, language understanding and generation, visual question answering, question generation, story ending generation, graph convolutional network
PDF Full Text Request
Related items