Behavior Recognition Methods In Complex Scenarios | Posted on:2023-06-16 | Degree:Doctor | Type:Dissertation | Country:China | Candidate:R Yan | Full Text:PDF | GTID:1528307331471804 | Subject:Computer Science and Technology | Abstract/Summary: | PDF Full Text Request | With the continuous improvement of information storage technology and Internet infrastructure,video data has become the main object and application scenario of future intelligent algorithm research.This paper focuses on a basic topic of video content understanding,namely human action recognition.This task is essentially a classification problem,i.e.,recognizing the behaviors performed by people in a given video.The previous research on action recognition mainly focused on some simple scenes(such as single-person gestures,single-person action,and two-person interaction),but this paper focuses on the actions in complex scenes,namely(multi-person interaction)group activity and(human-object interaction)compositional activity.Action recognition in complex scenes requires algorithms to understand not only single-person gestures,single-person actions,and interactions between two persons,but also potential interactions between multiple persons or objects.Action recognition technology for complex scenes plays an indispensable role in applications such as intelligent monitoring,intelligent entertainment,and intelligent retail.This paper starts with two specific problems(group activity recognition and compositional activity recognition)to study action recognition in complex scenarios.The main challenges of group action recognition are two-fold:ⅰ)the context relationship between persons is complex,and ii)the cost of fine-grained supervision is high.The main challenge of compositional action recognition is that existing deep models usually learn inductive biases from the training data,resulting in poor generalization ability on samples with unseen“action-object”pairs.This paper focuses on these issues and conducts the following research:·We propose Participation-Contributed Temporal Dynamic Model for group activity recognition.The algorithm can filter out key actors from multi-person scenes and aggregate their action features into a more discriminative representation.Specifically,for preserving Long Motions as much as possible,this method fuses the personal features according to the movement intensity of individual actions from high to low.Based on the similarity between person features and their spatial location information,the contextual interaction within the group is constructed to mine the interaction-relevant Flash Motions.In the process of person feature fusion,learnable weights are used to mine the Flash Motions semantically related to group activities.·We propose a Hierarchical Cross Inference Network for group activity.In order to fully exploit the potential spatiotemporal dependencies between multi-level information(such as body region,person,and group activity)in the scenes,this work first designs a generic Cross Inference Block.This block can simultaneously capture i)the spatial dependencies between each feature(such as between parts of the human body)and ⅱ)the temporal dependencies of a certain feature node(such as the evolution of an individual action over time).Cross Inference Block is applied to capture spatiotemporal dependencies between body regions or persons.The method does not require individual action labels and still achieves good performance on popular benchmark datasets.This makes it easier to be applied to crowded real-time scenes,where individual action labels cannot be provided.·We propose an Adaptive Interaction Module for group activity recognition.We first introduce a novel weak-annotation setting(i.e.,only video-level labels)for group activity recognition.A larger and more challenging dataset is collected at an extremely low cost based on the new setting.To alleviate the issue of inaccurate supervision brought by the weak-annotation setting,an Adaptive Interaction Module is proposed to automatically mine discriminative person and video frames,based on the assumption that“key instances are often closely related to each other”.·We propose a Progressive Instance-aware Feature Learning for composition activity recognition.This framework progressively injects instance information(position and identity)into human action feature extraction at different stages.Specifically,the framework includes 1)Position-aware Apparence Feature Extraction:with the help of the instance position,instance-centric appearance features are extracted from videos;2)Identity-aware Feature Interaction:differentiated context modeling across each instance-level feature with identity information;3)Semantic-aware Position Prediction:Predicting future position of instances from semantic features to facilitate the model’s ability to perceive instance motion.·We propose a video-language joint understanding method for composition activity recognition.We first formulate the task and introduce a more practical split and reasonable metric.Based on this,we further propose a novel framework“Look Less Think More":ⅰ)In the visual representation space,instance-centric video mutation is used to construct counterexamples for breaking the potential inductive bias between object appearance and action semantic;ⅱ)In the language representation space,it mines commonsense associations between object tags and human action labels based on the contrastive learning mechanism. | Keywords/Search Tags: | Human action recognition, Video understanding, Group activity, Human-object interaction, Spatio-temporal reasoning, Graph reasoning, Compositional general-ization, Multimodal understanding | PDF Full Text Request | Related items |
| |
|