| With the continuous development of information technology,multi-modal data have become the main expression form of data resources recently.In this paper,we focus on the cross-modal matching problem for both visual and linguistic modal data,i.e.,establishing connections between visual and linguistic signals to ensure a consistent representation of the same pattern between visual and linguistic.This problem plays a crucial role in many critical tasks(visual question answering,cross-modal retrieval).Current works on cross-modal matching mainly focus on the cross-modal joint representation and cross-modal interaction capabilities for performance optimization.These works ignore the structural information embedded in visual signals and the interplay between crossmodal matching and structural parsing,resulting in models with insufficient ability to express structural information in visual signals for complex pictures(diagrams,charts,etc.)for crossmodal matching tasks.In addition,most of the current cross-modal matching studies are restricted to closed environments in which the training set and the test set are independent and identically distributed.In such environments,models cannot cope with domain shifts and thus do not have the prerequisites for large-scale applications.Finally,considering the continuous iteration of cross-modal matching in the open world,the ability to learn potentially unknown concepts is necessary,which has not been fully investigated in current works.For the cross-modal matching problem in a closed environment,this paper proposes a Hierarchical Multi-Task Learning(HMTL)model for diagram question answering based on a multi-modal transformer framework.The proposed approach utilizes different Transformer layers to form a hierarchical structure for multi-task learning,making diagram parsing and cross-modal matching mutually reinforcing.The proposed model achieves state-of-the-art performance on AI2 D and FOODWEBS datasets,proving its superiority and effectiveness.For the domain adaptation problem under domain shift settings,this paper proposes a rank-aware cross-modal adversarial domain adaptation network for the cross-modal retrieval task.Feature encoders and a domain discriminator are trained adversarially to reduce domain differences.To better introduce task constraints,a ranking predictor is designed to predict the ranking scores of unimodal target domain samples.The predicted scores are input into the domain discriminator to help domain alignment.In addition,to better utilize the knowledge obtained from the source domain,a source concept classifier is designed for conceptual alignment between the source and the target domains.The proposed network performs domain adaptation experiments on the image-text retrieval datasets MSCOCO and Flickr30 k.To address both cross-modal domain adaptation and unknown concept learning in an open environment,this section proposes a new paradigm called “active cross-modal domain adaptation” to help adapt to unknown concepts.Specifically,an active cross-modal domain adaptation framework is designed,which consists of a cross-modal domain adaptation module and an active learning module.The cross-modal domain adaptation module performs domain alignment with existing supervised information.Then the active learning module identifies and annotates target samples containing unknown concepts using a rank-aware active learning strategy.The two modules are trained simultaneously to promote each other.Experimental results on two active cross-modal domain adaptation benchmarks demonstrate the significant improvements of the proposed method for domain shifts and unknown concepts. |