Deep Learning Based Video-Text Cross-Modal Retrieval

Posted on:2021-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:R Zhao

Full Text:PDF

GTID:2428330602994381

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the increasing development of Internet,the multimedia content has shown increasing growth.In order to make users find the content they need more quickly and accurately from massive multimedia data,the retrieval technology for multimedia content has attracted growing attention.Video-text cross-modal retrieval is a specific retrieval task for the modalities of video and text in multimedia retrieval.It aims at retrieving the corresponding video while giving a text query or retrieving the corresponding text while giving a video query.The main difficulties for this task is the understanding of the sequential information in video and text and the matching between video and text.Based on deep learning,this thesis proposes two kinds of cross-modal video-text retrieval method from two perspectives:1.A Stacked Convolutional Deep Encoding Networks for Video-Text Retrieval is proposed.The existing methods rarely explore long-range dependency inside video frames or textual words leading to insufficient textual and visual details.This method proposes a stacked multi-scale dilated convolution module for video-text retrieval task,which considers to simultaneously encode long-range and short-range dependency in the videos and texts.The multi-scale dilated convolutional(MSDC)block within the approach is able to encode short-range temporal cues between video frames or text words by adopting different scales of kernel size and dilation size of convolutional layer.A stacked structure is designed by repeatedly adopting the MSDC block,which further captures the long-range relations between these cues.Moreover,to obtain more robust textual representations,the powerful language model named Transformer is utilized in two stages:pretraining phrase and fine-tuning phrase.2.A Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval is proposed.The existing methods look for local negative samples only in a mini-batch while ignoring global negative samples during training,and also ignore a peculiarity of the retrieval data:one video to multiple texts.To solve the above problems,this method proposes to utilize memory module to assist the feature encoding of video and text.It mainly proposes two types of memory module:one is cross-modal memory module,which is adopted for global negative mining.Another one is text center memory module.It is designed to record the center information of the multiple textual instances in a video,which aims at bridging these textual instances together.In this thesis,a large number of video-text retrieval experiments are carried out on the MSR-VTT data set,MS VD data set,and VATEX data set to prove the effectiveness of the methods in the thesis,and the retrieval performance exceeds the state-of-the-art algorithms.

Keywords/Search Tags:

cross-modal retrieval, embedding learning, convolutional network, Transformer, memory module, momentum encoder

PDF Full Text Request

Related items

1	Cross-modal Retrieval Research Based On Correlation Analysis And Structure Preserving
2	Audio-Video Based Cross-modal Speaker Retrieval And Recognition
3	Design And Implementation Of Retrieval System Oriented To Cross-modal Hashing
4	Research On Cross-modal Hashing Algorithms For Large-scale Multimedia Retrieval
5	Research On Cross-modal Retrieval Of Speech And Image Based On Deep Neural Network
6	Research On Novel Retrieval Techniques For Fashion Media Data
7	Heterogeneous Graph Hashing For Cross-Modal Audio-Image Retrieval
8	Coupled-hashing For Cross-modal Retrieval
9	Design And Implementation Of DCGAN-based Image-text Cross-modal Retrieval System
10	Research On Algorithm Of Deep Convolution Network And Feature Fusion For Cross Modal Commodity Retrieval