Font Size: a A A

Researches Of The Activity Understanding Based On Dynamic Representation Learning

Posted on:2023-05-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LinFull Text:PDF
GTID:1528306848957729Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Activity understanding is one of the core problems in computer vision and artificial intelligence as it has important theoretical significance and potential application value.On one hand,activity understanding is challenging,because it not only relates to low-level video representations,but also provides a solid foundation for high-level semantic understanding and visual question answering.One the other hand,with the rapid development of multimedia and continuous increase of social demands,activity understanding plays an important role in real-world applications,such as autonomous driving,robotic manipulation,visual surveillance and entertainment.In this paper,we focus on the research of activity understanding based on motion representation learning from pixel-level to semantic-level.For pixellevel task,we study the video prediction approach based on motion-aware feature enhancement.For high-level semantic task,we further explore the human-object interaction detection task.To solve the problem of subtle difference between finegrained actions and non-interactive pair domination,we study action-guided attention mining and relation reasoning method for human-object interaction detection.Moreover,to boost the poor detection performance of rare categories,we explore to learn the motion-relevant knowledge from unlabelled video to help human-object interaction detection.Furthermore,in order to get rid of the constraints of extra data,we study to mining the ground-truth labels for HOI detection.The main contributions of this paper are summarized as follows:(1)Most approaches tackle the problems by pixel-level reconstruction objectives and two-stream structure,which still suffer from missing motion details for blurry generations or dramatic degradations in long-term prediction.Therefore,we propose a Motion-Aware Feature Enhancement(MAFE)network for video prediction to produce sharp future frames and achieve relatively long-term predictions.First,a Channel-wise and Spatial Attention module is designed to extract motionaware features,which enhances the contribution of important motion details during encoding,and subsequently improves the discriminability of attention map for the frame refinement.Second,a Motion Perceptual Loss is proposed to guide the learning of temporal cues,which benefits to robust long-term video prediction.Extensive experiments on three human activity video datasets: KTH,Human3.6M,and PennAction demonstrate the proposed MAFE can achieve better performance compared with the current video prediction approaches,and MAFE can alleviate the blurry prediction.(2)Human-Object Interaction(HOI)detection is challenging due to subtle difference between fine-grained actions,and multiple co-occurring interactions.Most approaches tackle the problems by considering the multi-stream information and even introducing extra knowledge,which suffer from a huge combination space and the non-interactive pair domination problem.We propose an Action-Guided attention mining and Relation Reasoning(AGRR)network to solve the problems.Relation reasoning on human-object pairs is performed by exploiting contextual compatibility consistency among pairs to filter out the non-interactive combinations.To better discriminate the subtle difference between fine-grained actions,an action-aware attention based on class activation map is proposed to mine the most relevant features for recognizing HOIs.Extensive experiments on V-COCO and HICO-DET datasets demonstrate the proposed AGRR can achieve better performance independent of extra knowledge,even outperforming the methods relied on human pose or word embedding.(3)The current works on HOI detection usually rely on expensive large-scale labeled image datasets.However,in real scenes labeled data may be insufficient and some rare HOI categories have few samples.This poses great challenges for deep learning based HOI detection models.Existing works tackle it by introducing compositional learning or word embedding,but still rely on the well-learned knowledge.In contrast,the unlabeled videos contain rich motion-relevant information that can help infer rare HOIs.Thus,we creatively propose a multi-task learning perspective to assist in HOI detection with the aid of motion-relevant knowledge learning on unlabeled videos.Specifically,we design the appearance reconstruction loss and sequential motion mining module in a self-supervised manner to learn more generalizable motion representations for promoting the detection of rare HOIs.Moreover,to better transfer motion-related knowledge from unlabeled videos to HOI images,a domain discriminator is introduced to decrease the domain gap between two domains.Extensive experiments on the HICO-DET dataset with rare categories and V-COCO dataset with minimum supervision demonstrate the effectiveness of motion-aware knowledge implied in unlabeled videos for HOI detection,especially for the detection of rare HOI categories.(4)Existing works on HOI detection usually introduce spatial context,extra knowledge or graph-based propagation based on the original hard labels.However,they still face challenges in dealing with action co-occurrence and complex HOIs.To address these challenges and overcome the constraints of additional data,we propose to mine the ground-truth annotations to get implied information for structure representations in HOIs.The Action-aware Closeness Labeling task is designed to capture the scene context based on the statistic of action co-occurrence from the data source.Furthermore,we present a human-object Relation Graph Supervision to get more reliable relations in complicated scenes by constraining the attention weights of the human-object relation graph.Such a direct supervision on mutual relations is ignored in existing works.Extensive experiments on the V-COCO and HICO-DET datasets demonstrate the superiority of our proposed method compared with the other approaches for HOI detection.Besides,the proposed method does not rely on any extra data and can handle HOIs in complicated scene.
Keywords/Search Tags:Activity understanding, Video prediction, Human-object interaction detection, Self-supervised learning, Multi-task learning, Graph-based reasoning, Attention mechanism
PDF Full Text Request
Related items