Font Size: a A A

Research On Multimodal Representation Learning Based Multi-level Attention Mechanism

Posted on:2023-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:M G HeFull Text:PDF
GTID:2568306830452874Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Multimodal representation learning has become a hot research direction in the fields of computer vision,natural language and speech processing.However,most of the existing multimodal representation learning methods still face some challenges in multi-modal hierarchical feature extraction and multi-level feature fusion methods that need to be solved urgently.(1)The problem of information asymmetry,that is,the existing methods focus more on how to extract the fine-grained local information of a certain modality,resulting in that the corresponding part cannot be found in other modalities,thus affecting the role of the supervision information.(2)Hybrid-level fusion problem,that is,the existing methods mostly use unified representation methods to perform hybrid-level fusion of multi-modal data,resulting in the destruction of the hierarchical nature of the data.These challenges make it difficult for most existing methods to effectively solve multimodal representation learning tasks.In order to solve the problem of information asymmetry,this paper starts from two directions: a multi-level feature extraction model and an inter-level loss function.In this paper,the Transformer is used to extract the contextual features and local features of the video and text from different angles,and then the multiple local features of the video and text are fused using the attention mechanism to obtain global features.Finally,the fused global features and context features are fused to obtain modal features,which realizes the full use of global features from different angles,increases supervision information,and effectively avoids the information asymmetry problem caused by fine-grained local information.Second,to strengthen inter-level constraints,this paper proposes an inter-level consistency loss function to further aid representation learning.This paper conducts a large number of experiments on video-text retrieval tasks and video captioning tasks and provides visualization results.The experimental results prove that the features extracted by the proposed method can contain richer information.In order to solve the mixed-level fusion problem,this paper starts from two directions: the proposed multi-level fusion method and the cross-modal knowledge distillation.This paper first proposes a multi-modal collaborative representation model with multi-level feature extraction and hierarchical feature fusion.First,features at different levels are extracted from the data,and then between different modalities,the features at the same level are classified into different levels using learnable weights,effectively solves the mixed-level fusion problem.Second,because of the different quality and granularity of multimodal data,this paper proposes a crossmodal knowledge distillation method to extract useful information.This paper conducts extensive experiments on video-text retrieval tasks and video captioning tasks and provides visualization results.The experimental results show that the hierarchical fusion method proposed in this paper can effectively fuse the features of multimodalities to help representation learning.
Keywords/Search Tags:Multimodality, Representation Learning, Multi-level, Attention Mechanism
PDF Full Text Request
Related items