Research Of Video Temporal Activity Retrieval Based On Attention Networks

Posted on:2021-10-30

Degree:Master

Type:Thesis

Country:China

Candidate:X Huang

Full Text:PDF

GTID:2518306122474704

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Users can create and share multimedia data(e.g.,text,image,video)on the Internet anywhere anytime with the proliferation of the Internet,sensor-rich mobile devices,social media,and other technologies.However,for a large amount of multimedia data generated,how to efficiently and accurately retrieve the multimedia data that users need or are interested in is a problem with the value of the practical application.Searching videos of interests from large collections based on keywords entered by the users has become a research hotspot in the field of multimedia information retrieval.However,if users are intent to localize the desired temporal moment or related events in an untrimmed long video,new methods and framework solutions are needed and then video moment retrieval technology emerges as the times require.Video temporal moment retrieval,which aims at identifying the specific start and end time points within a video in response to the given description query.Attention is proposed to extract the content related to the target from the source sequences.It is widely used in cross-modal retrieval models to extract textual or visual features related to targets and the performance of models is improved significantly.Therefore,this paper has research on cross-modal video moment retrieval based on attention networks.The content of this paper mainly includes:(1)This paper presents a language-temporal attention network.This method utilizes the attention mechanism to adaptively assigns the weight to the keywords in the given query based on temporal moment contexts in the video.This model is able to adaptively encode complex and significant language query information for localizing desired moments.Then the textual features,visual features of the video segment,and moment context information are jointly modeled to enhance the interactions between multimodal features,and a temporal regression localizer is employed to identify the specific start and end time points of the desired video moment.(2)This paper presents a spatial information and language-temporal based cross-modal retrieval method.This model comprises of two attention sub-networks,namely spatial attention,language-temporal attention,which can recognize the most relevant information in the video,and simultaneously highlight the keywords in the query.Then the visual features of the video segment and textual features are jointly modeled,and a temporal regression localizer is employed to identify the specific start and end time points of the desired video moment.(3)This paper presents the tensor fusion based cross-modal retrieval method.This model introduces a tensor fusion module,which employs a mean pooling operation on the visual features of the temporal moment contexts and textual features.Then a tensor fusion network is employed to capture the intra-modal and inter-modal embedding interactions,enhancing the data fusion between different modalities.All the methods proposed in this paper are verified on the three benchmark datasets of TACOS,Charades-STA,and Di De Mo.The language-temporal attention method is 0.24% to 2.49% higher than the CTRL.Spatial and language-temporal attention method is 0.73% to 2.67% higher than the former.After adding the tensor fusion network,the spatial and language-temporal attention method increases by 0.3%to 2.26%.

Keywords/Search Tags:

cross-modal retrieval, moment localization, spatial attention network, language-temporal attention network, tensor fusion network

PDF Full Text Request

Related items

1	Jointly Cross-and Self-modal Graph Attention Networks For Query-based Moment Retrieval In Videos
2	The Research On Methods For Cross-modal Retrieval In The Domain Of Recipe
3	Multimodal Processing Technology For Video Analysis
4	Cross-modal Retrieval Method Based On Dependence Relationship Attention And Social Information
5	Deep Attention Based Cross-Modal Person Search Via Natural Language Descriptions
6	Research On Key Techniques Of Cross-Modal Retrieval
7	Video-Natural Language Temporal Localization Based On Deep Learning
8	Research And Application On Metric Learning Based On Attention Network
9	Attention-based Fusion Triplet Hashing For Cross-modal Retrieval
10	An Optimized Approach To Cross-Modal Retrieval Based On Multi-level Attention Mechanism