| Cross-modal matching is a method for retrieving data between different modalities.Common modal categories include text,image,audio,video,etc.The starting point is to uncover the interrelationships between cross-modal samples,i.e.to search for information in one query modal sample with a similar semantic representation of another modality.Visual and language-based cross-modal matching consists of two core tasks: image-text matching and video-text matching.The former requires retrieving the corresponding image of a known sentence and querying the image for a matching textual description.The latter entails searching for a paired video of a given text,as well as the associated annotations of the input video.They enable bi-directional retrieval of both visual and textual modalities in different ways,where the common challenges are the cross-modal heterogeneous gap and the semantic divide.To overcome two long-standing challenges in this area,this thesis proposes solutions for each of the two core tasks.In image-text matching,existing algorithms use global features of both modalities for coarse-grained matching or correlate local regions of the image and words in the text for fine-grained alignment.However,they ignore the importance of multi-granularity information in the visual modality,making it difficult to correctly associate low-level visual representations with high-level semantic cues.The prosperous coarse-grained and fine-grained information contained in images is crucial to the performance improvement of image-text matching models.Unlike image-text matching,video-text matching needs to consider the relationships between video frames and the differences in text annotations of the same video.Current approaches typically use a single modal mean pooling or attention mechanism to aggregate video frames,making it difficult to accurately align text and video.Furthermore,large-scale pre-trained models have demonstrated strong visual language representation capabilities,and many works have transferred them to video-text matching,while most approaches do not fully exploit the knowledge learned from pre-trained models and perform weakly in downstream tasks.To address the first problem,this thesis proposes a granularity-aware semantic aggregation network that adaptively mines the multi-grain information in visual modalities and performs multi-scale local reconstruction of input image features to retain distinguished granularity features after removing redundancy.At the same time,the network learns multi-granularity cues in visual and textual information in a unified embedding space,applying aggregation centers in the embedding space to achieve semantic knowledge sharing,and further facilitating fine-grained alignment of images and text to reduce the semantic divide between heterogeneous modalities.To address the second problem,this thesis proposes a cross-modal reasoning network based on visual and textual prompts,which enables prompt learning through a small number of learnable parameters introduced to tap into potential cross-modal knowledge in large-scale pre-trained picture-text matching models.It allows model to better understand contextual information,and further reduce the heterogeneous divide between modalities.In addition,the network considers textual information as a condition for frame aggregation,emphasizes the frames that are most semantically similar to the text,draws the correlation between text descriptions and corresponding frames,and suppresses the expression of other redundant frames to reduce the semantic divide between visual languages and improve the accuracy of cross-modal matching. |