Font Size: a A A

Research On Video Text Retrieval Algorithm Based On Relational Network

Posted on:2022-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:N WangFull Text:PDF
GTID:2518306764966889Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
The video-text retrieval task requires the user to input a query text(video)and retrieve the most semantically similar video(text).Most of the current video-text retrieval methods are transformed from the field of image-text retrieval,and images only contain spatial information and do not contain temporal information.Therefore,most of the transferred methods lack the modeling of temporal information in videos.Assuredly,there are also a few methods that use convolutional neural networks and recurrent neural networks to reason about temporal relations in videos,but the results are not satisfactory,especially when the video contains information such as spatial transformation,background transformation,or actions.Therefore,this thesis focuses on the video temporal information,and proposes attention-based relation reasoning network(ARRN).The ARRN could learn and reason about multi-scale relations between words in sentences and multi-scale temporal relations between video frames.In addition,this thesis believes that the accurate representation of single-modal information is the basis of multi-modal tasks.Based on this issue,this thesis designs a global-to-local attention mechanism to jointly capture the local and global features of video(text),and then fuse the global and local features,which significantly improves the feature representation capability of the single modality.Finally,in the retrieval task,when the feature similarity between different modalities needs to be measured,the traditional loss function is difficult to measure accurately due to the problem of semantic gap.Therefore,this thesis designs a projection matching loss(PML).Using PML,the model can further align the two feature distributions to learn a more efficient common subspace.This thesis tests the model on multiple datasets and achieves significant improvements.With the advent of the data era and the vigorous improvement of computing power,in the multi-modal field,more and more networks use massive data and pre-training strategies to improve the effect of the model.The researchers also transferred the pre-training models in image-text to the field of video-text.This transfer also caused the model to not pay enough attention to the temporal information in the video.In addition,the existing large models basically use the Transformer as the backbone network,but the Transformer itself has the problem of poor local information modeling ability.Therefore,this thesis proposes the multi-scale temporal difference transformer(MSTDT),which aims to improve the Transformer's ability to model local relations.In order to make the model pay attention to the video temporal information,this thesis introduces the temporal difference feature,which mainly describes the fine-grained temporal information of the video.The MSTDT can learn the multi-scale temporal relation and fine-grained information in the video,and then achieve the understanding of fine movements and complex scene transformations.Furthermore,in order to align the features within a single modality and better deal the cross-modal semantic gap problem,this thesis also proposes a binary similarity loss(BSL).In this thesis,CLIP is used as the backbone network,and the MSTDT is inserted into the video modeling module of CLIP as a temporal relation network for modeling video.Combined with the BSL,the final model effect has been greatly improved.
Keywords/Search Tags:Multi-modality, Video text retrieval, Relational network, Feature alignment, Heterogeneous gap
PDF Full Text Request
Related items