Font Size: a A A

Research And Application On Techniques Of Cross-Modal Correlation Learning

Posted on:2023-10-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:A LiFull Text:PDF
GTID:1528307136999239Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of multi-media communications and sensing technology,traditional audio-visual services will not be able to meet the immersive and various demands of users.Multi-modal services that involving multi-sensory experience have become the new direction of China’s information construction.Besides,new scenarios will inevitably bring massive multi-source heterogeneous data such as video,audio,text,and haptics.Therefore,there is an urgent need to develop lightweight,efficient and flexible multi-modal artificial intelligence technology.However,due to the complicated multi-modal alignment and fusion operations,existing multi-modal deep learning models have several technical issues such as high computational complexity,error accumulation effect,difficult training balance,and poor robustness.Thus,this dissertation aims at two fields: multi-modal signal processing and multi-modal communication,and focuses on exploiting cross-modal correlation information between different modalities.For different application requirements,this dissertation designs corresponding cross-modal correlation learning methods,so as to realize a powerful support from image recognition to intelligent communication.Specifically,the contribution of this dissertation can be summarized as the following four aspects:In the "text-image" fine-grained image recognition,a cross-modal fine-grained recognition scheme is proposed to solve the issues of ultra-fine-grained samples and partial occlusion.On the one hand,this scheme not only utilizes the priori relationships between the text labels to construct a non-random image sampler,but also designs a contrastive learning based loss function.These two modules collaboratively help convolutional networks distinguish ultra-fine-grained targets On the other hand,a global information assisted network is proposed to utilize global contextual semantics to filter out local semantics influenced by partial occlusion.In this way,the scheme achieves significant improvements in accuracy as well as robustness for the fine-grained image recognition.In the "infrared-RGB" object detection,a cross-modal object detection scheme is proposed to solve the issue of lightweight deployment.Through selective feature distillation strategy and adaptive prediction distillation strategy,this scheme respectively transfers the feature knowledge and the prediction knowledge from multi-modal object detection network to the uni-modal counterpart in a supervised learning manner.Then,only the well-trained uni-modal object detection network is deployed into the platforms with limited computational resources.Therefore,the accuracy of the object detection algorithm in the platforms is largely enhanced without increasing any computation parameters,so as to realize the trade-off between algorithm accuracy and computational complexity.In the "video-haptics" signal reconstruction,a cross-modal signal reconstruction scheme is proposed to solve the issue of poor haptic experience at the communication receiver.Firstly,a large-scale dataset namely Vis Touch that includes audio,video,and haptic signals is constructed.This dataset can lay the foundation for various cross-modal researches.Secondly,in order to reconstruct precise multi-modal signals,a systematic cross-modal signal reconstruction architecture including feature extraction module,reconstruction module,and evaluation module is proposed.Furthermore,the reconstruction from video to haptic signals is taken as an example.A video-assisted haptic reconstruction model is established,including a video extraction network,a haptic signal generation network and a haptic signal discrimination network.Finally,experimental results demonstrate that the proposed method can precisely reconstruct the semantically consistent haptic signals by using video signals,so as to effectively enhance the haptic experience at the receiver.In the "audio-video-haptics" semantic communications,a cross-modal semantic communication architecture is proposed to solve the issues of semantic coding polysemy and semantic decoding ambiguity.Firstly,an explicit-implicit parsing based cross-modal semantic encoder is developed to exploit the potential semantic correlation information in the multi-modal signals,and view it as a media to infer implicit semantics.In this way,the polysemy is effectively reduced.Then,a reinforcement learning based cross-modal semantic decoder is developed to realize the high-quality signal reconstruction through the dual optimization of semantic similarity and bit similarity.In this way,the ambiguity is effectively reduced.Next,a cross-modal knowledge graph is designed to provide rich background knowledge and signal patches for cross-modal semantic encoding and decoding.The experimental results demonstrate that the proposed method can achieve efficient semantic encoding and decoding with a very low compression rate,so as to improve the efficiency of the semantic communication system and satisfy the high reliability requirement of users for multi-modal services.
Keywords/Search Tags:Multi-modal Services, Fine-grained Image Recognition, Object Detection, Signal Reconstruction, Semantic Communications
PDF Full Text Request
Related items