Font Size: a A A

Research On Semantic Understanding And Generation Method Of Multimodal Task-Oriented Dialog Based On Pre-training And Fine-tuning

Posted on:2024-07-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y MaFull Text:PDF
GTID:1528307319964129Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,with the popularity of multimedia equipment,task-oriented dialog systems oriented to multimodal scenarios have more and more broad application prospects.As an effective parameter transfer approach,pre-training-fine-tuning can first perform multimodal self-supervised representation learning on large-scale image-text pairs,and then provide embedding support for downstream multimodal tasks.However,in the complex and challenging multi-modal task-oriented dialogue,the existing pre-training models can not be directly transferred to the downstream dialogue tasks due to the problems of poor modal alignment,difficult visual understanding,weak knowledge reasoning and low quality of response.To cope with these issues,a progressive pre-training-fine-tuning method was proposed from a longitudinal perspective,that is,the modal alignment pre-training was first conducted on a large-scale image-text pairs to enhance the cross-modal semantic embedding ability of the pre-training model.After that,the visual-comprehension & questionanswering,knowledge retrieval & reasoning abilities of the pre-training model were further enhanced by fine-tuning the two subtasks of visual question answering(VQA)and knowledge-oriented dialogue(KOD).Finally,the combined fine-tuning is carried out on the multi-modal e-commerce dialogue dataset to ultimately improve the semantic understanding and response generation ability of task-oriented dialogue.To be specific:Firstly,in terms of visual-language pre-training based on modal alignment,the existing pre-training methods based on mask reconstruction and contrast learning are difficult to model explicit modal alignment,especially tend to ignore the cross-modal semantic alignment of key entities,which is adverse to the semantic understanding and representation transfer.In order to solve the problem of poor fine-grained modal alignment in the pretraining models,a cross-modal visual-language pre-training method based on associative learning is proposed,which can model the fine-grained semantic mapping relationship between different modalities in an implicit associative mapping space through cross-modal feature prompts and contextual attentions,so as to enhance the cross-modal alignment performance of the visual-language pretraining models.Experimental results on four downstream multimodal tasks,including visual question answering,visual reasoning,visual entailment,and referring expression comprehension,show that the proposed associative learning method has fine-grained semantic understanding and modal alignment capabilities,and can provide effective cross-modal representation embedding support for questionanswering,reasoning,discrimination,and understanding in downstream multimodal dialogue.Secondly,in terms of question answering fine-tuning tasks oriented to visual comprehension,to solve the problems of incompatibility and language bias in the existing visual-language pre-training models for downstream multimodal dialogue,a hybrid prompttuning strategy with human priori was proposed.By constructing cloze-style templates similar to upstream tasks for prompt-tuning,it can effectively alleviate the incompatibility of upstream and downstream tasks,and guide the model to carry out effective visual understanding and question-answer through trainable human prompts.The experimental results show that the hybrid prompt-tuning strategy integrated with human priors can improve the visual understanding and question-answering ability of the pre-training models.Then,in terms of task-oriented dialogue fine-tuning for knowledge reasoning,a taskoriented dialogue fine-tuning method based on intention reasoning network is proposed to solve the problems of the visual-language pre-training model and the current task-based dialogue method,such as weak knowledge retrieval and reasoning ability and unexplicable reasoning process.By using the memory network mechanism for coarse-grained knowledge retrieval and an intention reasoning module for fine-grained knowledge inference,the method can greatly enhance the dialogue model’s fine-grained knowledge retrieval and reasoning ability.Moreover,by designing a novel intention mechanism and hierarchical response mechanism,it can improve the robustness and reliability of the model’s response,so as to further improve user satisfaction.The experimental results show that the deltatuning strategy with external knowledge inference can improve the knowledge retrieval,inference and task-oriented response abilities of the dialog model.Finally,in terms of high-quality multimodal response generation,a multimodal taskoriented dialog method based on unified representation framework is proposed to solve the problems of disunity of vision,text and knowledge embedding and poor response quality in the current decoupled-based task-oriented dialog models.By utilizing a unified multimodal dialogue embedder to embed information from different modalities into a unified semantic space,the pre-trained embedder can be fine-tuned by adopting a coarse-grained image-text matching task and a fine-grained word-region alignment task to obtain a better representation of user intents and regard it as a user’s intention-aware query vector.Then,by designing a fine-grained knowledge query and reasoning module based on key-value memory,the entity-level knowledge memory and textual response can be effectively carried out.Finally,the task-oriented dialogue model that supports the semantic understanding and multimodal response generation can be effectively trained by the combined fine-tuning strategy.The experimental results show that the proposed method can effectively improve the response quality of multimodal task-oriented dialogues,and the method can be generalized to a variety of tasks including cross-modal semantic understanding and multimodal knowledge reasoning.
Keywords/Search Tags:Multimodal Task-oriented Dialogue, Modal Alignment, Knowledge Reasoning, Pre-training and Fine-tuning, Image-Text Response
PDF Full Text Request
Related items