| In many application contexts,heterogeneous data might produce more desirable outcomes than single-modal data.Multimodal learning has been developing quickly in recent years.As a result,a crucial step in comprehending multimodal data is bridging the semantic gap between different modalities.This work investigates a more robust method to establish semantic alignment between modalities because present models overlook the influence of adversarial attacks in doing so and are susceptible to the influence of adversarial examples during the inference stage.In the meanwhile,this study explores the use of prompt learning in various applications using the visual language pre-training model.This thesis proposes a powerful cross-modal training strategy to address the issue that adversarial examples can easily attack cross-modal learning.Adversarial examples are used as one of the augmentation methods in the data augmentation process,and adversarial training is included into the model training process.We study the adversarial robustness of our robust training method on image-text retrieval based on the cross-modal retrieval task.Analyzed is the effect of adversarial training on different modalities on both intraand inter-modal retrieval.At the image and sentence levels,current cross-modal retrieval methods mostly accomplish coarse-grained semantic alignment.However,on some downstream tasks,finergrained semantic alignment is required.Using prompt learning and visual language pretraining models,we investigate strategies in this study for finding fine-grained regions on images based on text semantics.To achieve the best localization outcomes,it is necessary to first extract entities and relations from natural language text,then optimize the relations with prompt learning to produce semantically more accurate relation graphs,and finally match the relations between fine-grained regions in images to the relations of the text modality.Finally,this thesis investigated the use of prompt learning to defend visual language pre-trained models from adversarial attacks.Humans are unable to visually comprehend adversarial perturbations,but vector-based prompt learning is able to learn from the adversarial examples and automatically create templates that adapt to adversarial perturbations on semantic changes.Adversarial examples cause changes at the semantic level compared to the original samples.The robustness of prompt learning under adversarial examples is validated in this thesis. |