Font Size: a A A

Cross-modal Video Retrieval Algorithm Based On Multi-semantic Clues And Metric Learning

Posted on:2022-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:L DingFull Text:PDF
GTID:2518306569497464Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Because underlying features of different multimodal data are heterogeneous,it is hard to directly compare the similarity between them,in other words,it is "semantic gap",cross-modal video retrieval is still a big challenge.Most of the existing cross-modal video retrieval algorithms extract the features of multi-modal data,and then map the feature data to a shared space,so that heterogeneous data with similar semantics are close to each other,while heterogeneous data with dissimilar semantics are close to each other.Farther,the global similarity relationship between different modal data is established.However,these methods ignore the rich semantic clues in the data,and the performance of generating features is poor.To solve this problem,this paper designs a cross-modal retrieval model based on multi-semantic clues.The model uses a multi-head self-attention mechanism to capture data frames that play an important role in the semantics of the video modal,and selectively focus on the video.The important information of the data is used to obtain the global feature of the video data;the Bi-GRU is used to capture the interactive feature between the contexts within the video modal data;the local information in the video is mined by joint coding the subtle differences between the local feature.The model designed in this paper is basically symmetrical,and the text feature extraction part is similar to the video part.Through the global features,context interaction features and local features of video data and text data,the multi-semantic clues of multi-modal data are formed,which can mine the semantic information better in the data and improve the retrieval effect.On this basis,to solve the problem that the traditional loss function satisfies the distance constraint in local data,but cannot satisfy the constraint in the global scope,a multi-modal metric learning algorithm based on equidistant and equal distribution is proposed.By limiting the distance between two samples of the same class and different classes to be equal,so that the semantically similar sample pairs are closer;by guiding different types of data to keep the intra-class structure compact and obey uniform distribution,so that the semantically dissimilar sample pairs‘ distance is farther.Through this metric learning algorithm,the retrieval effect can be further improved.Finally,a rich experiment was designed to verify the effectiveness of the model and the multi-modal metric learning algorithm.Experiments on the MSR-VTT dataset show that compared with the current state-of-the-art method,the method in this paper improves 5.0% on text to video retrieval tasks;the experiment on the MSVD dataset shows that when compared with the current state-of-the-art method,the method in this paper improves 11.1% on text to video retrieval tasks.Also,the multi-modal metric learning algorithm based on equidistant and equal distribution proposed in this paper increases by2.7% on the original basis.
Keywords/Search Tags:cross-modal video retrieval, multi-semantic cues, multi-leader attention mechanisms, distance measurement loss functions, multi-modal distance metric learning
PDF Full Text Request
Related items