Font Size: a A A

Deep Learning Based Video-Text Cross-Modal Retrieval

Posted on:2021-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:R ZhaoFull Text:PDF
GTID:2428330602994381Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the increasing development of Internet,the multimedia content has shown increasing growth.In order to make users find the content they need more quickly and accurately from massive multimedia data,the retrieval technology for multimedia content has attracted growing attention.Video-text cross-modal retrieval is a specific retrieval task for the modalities of video and text in multimedia retrieval.It aims at retrieving the corresponding video while giving a text query or retrieving the corresponding text while giving a video query.The main difficulties for this task is the understanding of the sequential information in video and text and the matching between video and text.Based on deep learning,this thesis proposes two kinds of cross-modal video-text retrieval method from two perspectives:1.A Stacked Convolutional Deep Encoding Networks for Video-Text Retrieval is proposed.The existing methods rarely explore long-range dependency inside video frames or textual words leading to insufficient textual and visual details.This method proposes a stacked multi-scale dilated convolution module for video-text retrieval task,which considers to simultaneously encode long-range and short-range dependency in the videos and texts.The multi-scale dilated convolutional(MSDC)block within the approach is able to encode short-range temporal cues between video frames or text words by adopting different scales of kernel size and dilation size of convolutional layer.A stacked structure is designed by repeatedly adopting the MSDC block,which further captures the long-range relations between these cues.Moreover,to obtain more robust textual representations,the powerful language model named Transformer is utilized in two stages:pretraining phrase and fine-tuning phrase.2.A Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval is proposed.The existing methods look for local negative samples only in a mini-batch while ignoring global negative samples during training,and also ignore a peculiarity of the retrieval data:one video to multiple texts.To solve the above problems,this method proposes to utilize memory module to assist the feature encoding of video and text.It mainly proposes two types of memory module:one is cross-modal memory module,which is adopted for global negative mining.Another one is text center memory module.It is designed to record the center information of the multiple textual instances in a video,which aims at bridging these textual instances together.In this thesis,a large number of video-text retrieval experiments are carried out on the MSR-VTT data set,MS VD data set,and VATEX data set to prove the effectiveness of the methods in the thesis,and the retrieval performance exceeds the state-of-the-art algorithms.
Keywords/Search Tags:cross-modal retrieval, embedding learning, convolutional network, Transformer, memory module, momentum encoder
PDF Full Text Request
Related items