Font Size: a A A

Action Recognition Based On Interactions

Posted on:2021-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:J XiaFull Text:PDF
GTID:2518306503480694Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the great success of deep networks in various image tasks,more and more researches have focused on video understanding tasks that are more complex.The task of action recognition is to locate the spatiotemporal position of all people in the video and identify the actions.It is one of the most important video understanding tasks.The topic of action recognition has great value both in academia and industry.Action recognition can be widely used in the fields of surveillance cameras,autonomous driving,platform video review and commercialization,human behavior researches,and so on.This article studies action recognition based on a variety of interactions in the video.The interaction relationship refers to the interaction between people and the environment in the video.We observe three types of interactions that are helpful for the recognition of actions,namely person-person interaction,person-object interaction,and temporal interaction.In order to model these interactions,we first use deep video networks in conjunction with object detection models to extract features of people and objects that appear in the video.For these extracted features,we propose a general interaction module based on the dot-product attention mechanism to model the above three interactions respectively.In order to fuse these three types of interactions,we propose a structure of serial reasoning,in which the interaction blocks are connected in serial.As the interaction network deepens,human action features are continuously strengthened and the fusion model is then able to model complex interactions.Long-term temporal interactions are important but complicated for action recognition.In the past,algorithms used to model long-term interactions consumed a lot of computing resources.To solve this problem,we proposed a feature pool and dynamic read-write algorithm.The feature pool stores the action features of the video over a long period of time.During training,the model uses a dynamic read-write algorithm to read and update the features in the pool.Using the feature pool and dynamic read-write algorithm,we can store the features of the area with a long temporal distance,avoiding the direct convolution operation on the entire video.Our model achieves more concise operations and better results.The model proposed in this article is tested on a large-scale action data set called Atomic Visual Actions(AVA).AVA is currently the largest and most distinguishable dataset.This paper sets up multiple sets of experiments to verify the computational advantages and accuracy advantages of each module proposed in the article.Compared with other state-of-the-art algorithms,our model uses the same or even less computational cost.With a single model,our algorithm outperforms other state-of-the-art methods for at least 5 m AP.Our interaction-based action recognition model achieves currently the new state-of-the-art performance.
Keywords/Search Tags:Video understanding, action recognition, deep learning, interactions, dot-product attention
PDF Full Text Request
Related items