Hybrid-Attention Enhanced Two-Stream Fusion Network For Video Venue Category Prediction

Posted on:2022-05-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y C Zhang

Full Text:PDF

GTID:2518306608481064

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

Video venue category prediction has been drawing more attention in the multimedia community for various applications(e.g.,personalized restaurant recommendation,tourist route planning and video place verification).Thanks to the development of portable devices,such as mobile phones and pads,more and more users would like to record their real lives by videos and upload them to social platforms for sharing.However,most users upload videos without venue annotations for personal privacy protection,which blocks the development of video venue category prediction.Most existing works resort to the information from either multiple modalities or other platforms for strengthening video representation.However,noisy acoustic information,sparse textual descriptions and incompatible cross-platform data could limit the performance gain and reduce the universality of the model.Different from existing works,we focus on discriminative visual feature extraction from videos by introducing the hybrid-attention mechanism,temporal components and a two-stream network.Particularly,we design a novel Global-Local Attention Module(GLAM)to extract complement content information.The GLAM consists of a Global Attention(GA),a Local Attention(LA)and a convolution layer.In GLAM,the GA is used to catch contextual scene-oriented information and their layouts via assigning channels with various weights while the LA is employed to learn salient object-oriented features via allocating different weights for spatial regions.Moreover,GLAM can be extended to ones with multiple GAs and LAs for further visual enhancement.The GLAM can also be inserted into various neural networks to generate enhanced visual features.These two types of features respectively captured by GAs and LAs are integrated via convolution layers,and then delivered into convolutional Long ShortTerm Memory(convLSTM)to generate discriminative spatial-temporal representations,constituting the content stream.In addition,video motions are explored to learn longterm movement variations because they also can contribute to video venue category prediction.Based on the content and motion information,we propose a Hybrid-Attention Enhanced Two-Stream Fusion Network(HA-TSFN),which contains a content stream and a motion stream.The content stream is utilized to extract content information from videos,such as scene-oriented information and object-oriented information,while the motion stream is used to capture some behaviors.Finally,HA-TSFN merges features from two streams for complementary representation.In this paper,we conduct extensive experiments on HA-TSFN to verify the effectiveness and generalization.The experimental results demonstrate that our method achieves state-of-the-art prediction performance in the large-scale dataset Vine.The visualization also shows that the proposed GLAM can capture complementary sceneoriented and object-oriented visual features from videos,which further indicates the interpretability of the proposed model.

Keywords/Search Tags:

Attention Mechanism, Feature Reinforcement, Venue Category Prediction, Deep Neural Network

PDF Full Text Request

Related items

1	Research On Punctuation Prediction Method In Real-Time Scenario
2	Media Popularity Prediction Algorithm Based On Multiple Attributions
3	Deep Learning Based CTR Prediction Under Attention Mechanism
4	Research And Implementation Of Image Saliency Prediction Based On Deep Neural Network
5	Research On Advertising Click-through Rate Prediction Model Based On Deep Learning
6	Research On Network Traffic Prediction Based On Deep Learning Method
7	Research On Deep Learning Algorithm For Sequence Data
8	Research On Time Series Forecasting Technology Based On Deep Neural Network
9	Research On Recommendation Algorithms Of Deep Neural Network Based On Attention Mechanism
10	Research On Click-through Rate Prediction Technique Combining Sequence Data And Attention Mechanism