Font Size: a A A

Research Of Video Temporal Activity Retrieval Based On Attention Networks

Posted on:2021-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:X HuangFull Text:PDF
GTID:2518306122474704Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Users can create and share multimedia data(e.g.,text,image,video)on the Internet anywhere anytime with the proliferation of the Internet,sensor-rich mobile devices,social media,and other technologies.However,for a large amount of multimedia data generated,how to efficiently and accurately retrieve the multimedia data that users need or are interested in is a problem with the value of the practical application.Searching videos of interests from large collections based on keywords entered by the users has become a research hotspot in the field of multimedia information retrieval.However,if users are intent to localize the desired temporal moment or related events in an untrimmed long video,new methods and framework solutions are needed and then video moment retrieval technology emerges as the times require.Video temporal moment retrieval,which aims at identifying the specific start and end time points within a video in response to the given description query.Attention is proposed to extract the content related to the target from the source sequences.It is widely used in cross-modal retrieval models to extract textual or visual features related to targets and the performance of models is improved significantly.Therefore,this paper has research on cross-modal video moment retrieval based on attention networks.The content of this paper mainly includes:(1)This paper presents a language-temporal attention network.This method utilizes the attention mechanism to adaptively assigns the weight to the keywords in the given query based on temporal moment contexts in the video.This model is able to adaptively encode complex and significant language query information for localizing desired moments.Then the textual features,visual features of the video segment,and moment context information are jointly modeled to enhance the interactions between multimodal features,and a temporal regression localizer is employed to identify the specific start and end time points of the desired video moment.(2)This paper presents a spatial information and language-temporal based cross-modal retrieval method.This model comprises of two attention sub-networks,namely spatial attention,language-temporal attention,which can recognize the most relevant information in the video,and simultaneously highlight the keywords in the query.Then the visual features of the video segment and textual features are jointly modeled,and a temporal regression localizer is employed to identify the specific start and end time points of the desired video moment.(3)This paper presents the tensor fusion based cross-modal retrieval method.This model introduces a tensor fusion module,which employs a mean pooling operation on the visual features of the temporal moment contexts and textual features.Then a tensor fusion network is employed to capture the intra-modal and inter-modal embedding interactions,enhancing the data fusion between different modalities.All the methods proposed in this paper are verified on the three benchmark datasets of TACOS,Charades-STA,and Di De Mo.The language-temporal attention method is 0.24% to 2.49% higher than the CTRL.Spatial and language-temporal attention method is 0.73% to 2.67% higher than the former.After adding the tensor fusion network,the spatial and language-temporal attention method increases by 0.3%to 2.26%.
Keywords/Search Tags:cross-modal retrieval, moment localization, spatial attention network, language-temporal attention network, tensor fusion network
PDF Full Text Request
Related items