| With the increasing growth of multimedia data and the gradual development of deep learning techniques,multimodal dialog systems have gained a wide range of attention from academia and industry.The introduction of ChatGPT,a dialogic robotic system,in late 2022 has directly pushed the task to the culmination of research.The goal of multimodal dialog systems is to engage in natural and authentic dialogs with users in multiple media formats(e.g.,images,videos,and text).Compared to text-only dialog systems,multimodal dialog systems allow the user to supplement the expression of intent with additional information such as images.This not only greatly improves the user experience,but also helps the system to better understand user needs.This paper focuses on multimodal dialog systems for online shopping scenarios,which involves both text and image data forms.Faced with diverse and complex dialogs,how to comprehensively understand users’ questions and generate accurate system responses is the key for multimodal dialog systems in this domain.Although existing approaches have achieved promising performance,the following challenges in question understanding(i.e.,user intention understanding)remain:(1)Textual sentence modeling.Existing work usually establishes feature vectors for whole sentences without carefully considering the importance of different words in textual sentences,ignoring product keywords with rich semantic cues that are important for understanding user intentions.Therefore,how to construct a textual encoder that can adaptively focus on intention-related information from textual sentences is one of the challenges of this task.(2)Relational context modeling.Existing approaches usually use hierarchical recurrent neural networks to encode utterance sequences,overlooking the complementary relationships between different utterances and the inconsistency of their contributions to user intentions.In view of this,how to consider the relational context of each utterance and adaptively reweight their contributions for precise user intention modeling is another challenge of this task.To address the above challenges,this paper proposes a relational graph-based contextaware question understanding framework,which progressively improves user intention modeling at both local and global levels.The framework consists of three parts:a node initialization module,a relational context modeling module and a response generator.First,to enhance users’ local intention understanding,the node initialization module introduces a novel multiple attribute matrix(e.g.,color,material,etc.)as a guide to highlight the product-related keyword information embedded in the utterances by performing hierarchical attention computation with each utterance.Then,to further enhance the user’s global intention understanding on the overall context,the relational context modeling module discards the commonly used hierarchical recurrent encoder and constructs a sparse graph attention network by extending the original graph attention network.The network can dynamically adjust the connection relationships between different utterances using the designed sparse adjacency matrix update strategy,sparse the dense connections between the utterances,and mine the effective relational context information of each utterance to carefully consider the complementary relationships between different utterances in the context.Finally,based on the effective user intention representation,the response generator can accurately generate text and image responses that meet user requirements.Extensive experiments on a benchmark dataset in this paper demonstrate the effectiveness of the proposed framework in user intention modeling. |