Font Size: a A A

Multimodal Cognitive Learning For Audio-visual Data

Posted on:2022-02-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:H L NingFull Text:PDF
GTID:1488306734479264Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of technologies such as mobile Internet and cloud storage,audio-visual data containing rich information have shown a geometric growth rate.These massive amounts of data not only bring problems for manual analysis,but also provide a data basis for the development of audio-visual multi-modal cognitive learning research.Multi-modal cognitive learning research based on audio-visual data is developed to simulate the way of human cognition.It aims to construct the machine intelligence cognitive model,including audio-visual cross-modal information conversion,audio-visual cross-modal semantic alignment,and audio-visual cognitive comprehension,by exploring audio-visual data.It can serve various practical needs in many fields such as national defense security,public safety,unmanned driving,and assistance for the blind.Therefore,the research of multi-modal cognitive learning based on audiovisual data shows important research value.However,there exists complex data structures in audio-visual data and serious heterogeneous semantic gap between modalities.It brings severe challenges to the research of effective audio-visual cognitive learning.The key problems include: 1)The semantic information retention problem in audio-visual cross-modal information conversion;2)the feature alignability problem in audio-visual cross-modal matching;3)the multirelation modeling problem in audio-visual cross-modal retrieval;4)the collaborative relationship modeling problem in audio-visual cognitive understanding.Focusing on the above key problems,this thesis has carried out relevant research and obtained several research results.The main content and innovations are as follows:1)To address the problem of semantic information retention in audio-visual crossmodal information conversion,this thesis proposes an audio description generation algorithm based on semantic retention.In order to preserve semantic information as much as possible in the process of audio-visual cross-modal information conversion,this thesis explores the semantic similarity subspace and realizes the conversion of cross-modal information at the level of low-dimensional features.Furthermore,in view of the difficulty in generating understandable natural audio descriptions,this thesis proposes onedimensional gated dilated convolution to model the internal complex relationships of cross-modal feature representations,thereby generating high-quality audio descriptions.A large number of experiments have shown that the proposed algorithm can generate understandable audio descriptions from visual images and achieve superior performance.2)To address the problem of the feature alignability problem in audio-visual crossmodal matching,this thesis considers the alignability of features from different modalites for the first time,and proposes an audio-visual cross-modal matching algorithm based on disentangled representation learning.By dividing the explanatory factor set in a latent space,the shared features between the two modalities are disentangled and semantic alignment is performed to improve the performance of the audio-visual crossmodal matching model.Extensive experiments are conducted on three downstream subtasks.The experimental results show the proposed algorithm can effectively separate the alignable shared features and non-alignable modality-dependency features between modalities,so as to present accurate semantic alignment across modalities.3)To address the problem of multi-relation modeling in audio-visual cross-modal retrieval,this thesis proposes an audio-visual cross-modal retrieval algorithm based on semantic consistency representaion learning.By exploring a semantically consistent expression space,the relationship between paired,intra-modal and non-paired inter-modal can be simultaneously modeled to narrow the heterogeneous semantic gap between the modalities.At the same time,in order to capture the long-range correlation of audio signals so as to learn effective audio semantic features,an effective audio coding network is proposed,which introduces one-dimensional expansion convolution to learn audio semantic features.A large number of experimental results prove that the proposed algorithm can effectively improve the semantic consistency between cross-modal feature representations,and the m AP retrieval index can be increased by nearly 9%.4)To address the problem of collaborative relationship modeling in audio-visual cognition understanding,this thesis proposes a visual attention prediction algorithm based on bio-inspired audio-visual cue integration.By designing an effective audiovisual positioning strategy,the consistency relationship between audio-visual information is learned,and the interference caused by the inconsistency between audio-visual cues is reduced,so as to carry out accurate sound source localization.Furthermore,a multi-cue fusion strategy is introduced for adaptively integrating multi-cues information and generating the final visual attention map.The experimental results show that the proposed algorithm can effectively improve the performance of the visual attention prediction model through adaptive integration of audio-visual cues.
Keywords/Search Tags:Muti-Modal Learning, Audio-Visual Cross-Modal Information Conversion, Audio-Visual Cross-Modal Semantic Alignment, Audio-Visual Cognitive Under-standing, Heterogeneous Semantic Gap
PDF Full Text Request
Related items