Research On Video Text Retrieval Algorithm Based On Relational Network

Posted on:2022-12-18

Degree:Master

Type:Thesis

Country:China

Candidate:N Wang

Full Text:PDF

GTID:2518306764966889

Subject:Computer Software and Application of Computer

Abstract/Summary:

PDF Full Text Request

The video-text retrieval task requires the user to input a query text(video)and retrieve the most semantically similar video(text).Most of the current video-text retrieval methods are transformed from the field of image-text retrieval,and images only contain spatial information and do not contain temporal information.Therefore,most of the transferred methods lack the modeling of temporal information in videos.Assuredly,there are also a few methods that use convolutional neural networks and recurrent neural networks to reason about temporal relations in videos,but the results are not satisfactory,especially when the video contains information such as spatial transformation,background transformation,or actions.Therefore,this thesis focuses on the video temporal information,and proposes attention-based relation reasoning network(ARRN).The ARRN could learn and reason about multi-scale relations between words in sentences and multi-scale temporal relations between video frames.In addition,this thesis believes that the accurate representation of single-modal information is the basis of multi-modal tasks.Based on this issue,this thesis designs a global-to-local attention mechanism to jointly capture the local and global features of video(text),and then fuse the global and local features,which significantly improves the feature representation capability of the single modality.Finally,in the retrieval task,when the feature similarity between different modalities needs to be measured,the traditional loss function is difficult to measure accurately due to the problem of semantic gap.Therefore,this thesis designs a projection matching loss(PML).Using PML,the model can further align the two feature distributions to learn a more efficient common subspace.This thesis tests the model on multiple datasets and achieves significant improvements.With the advent of the data era and the vigorous improvement of computing power,in the multi-modal field,more and more networks use massive data and pre-training strategies to improve the effect of the model.The researchers also transferred the pre-training models in image-text to the field of video-text.This transfer also caused the model to not pay enough attention to the temporal information in the video.In addition,the existing large models basically use the Transformer as the backbone network,but the Transformer itself has the problem of poor local information modeling ability.Therefore,this thesis proposes the multi-scale temporal difference transformer(MSTDT),which aims to improve the Transformer's ability to model local relations.In order to make the model pay attention to the video temporal information,this thesis introduces the temporal difference feature,which mainly describes the fine-grained temporal information of the video.The MSTDT can learn the multi-scale temporal relation and fine-grained information in the video,and then achieve the understanding of fine movements and complex scene transformations.Furthermore,in order to align the features within a single modality and better deal the cross-modal semantic gap problem,this thesis also proposes a binary similarity loss(BSL).In this thesis,CLIP is used as the backbone network,and the MSTDT is inserted into the video modeling module of CLIP as a temporal relation network for modeling video.Combined with the BSL,the final model effect has been greatly improved.

Keywords/Search Tags:

Multi-modality, Video text retrieval, Relational network, Feature alignment, Heterogeneous gap

PDF Full Text Request

Related items

1	Research On Multi-modality Pedestrian Re-identification Algorithm And Feature Retrieval
2	Research On Cross-modality Person Re-identification Method For Visible And Infrared
3	Research On Image Text Matching Algorithm
4	Video Text Detection Based On Multi-feature Fusion
5	Video Semantic Understanding With Multi-modality Feature Fusion And Variable Selection
6	Ulti-modality Fusion In Internet Image Search
7	RGB-Infrared Cross-Modality Person Re-Identification Based On Convolutional Neural Network
8	Research On Video Retrieval Technology Based On Multi-feature Fusion
9	Study On Techniques Of Multi-Modality Media Information Retrieval
10	Study On Blue And White Porcelain Images Retrieval Based On Text And Content