Interactive action detection,aiming to identify the actions that involve humanhuman interaction and human-object interaction,plays an important role in many fields,e.g.,human-computer interaction,autonomous driving,video surveillance.In recent years,the methods based on deep learning have substantially improved the performance of interactive action detection.However,existing approaches still suffer from various limitations,making them unsatisfactory in practical application.In this dissertation,we focus on two tasks related to interactive action detection,i.e.,human-object interaction detection and video action detection.These two tasks pay attention to two different data modalities,i.e.,images and videos,which are frequently used in most realistic application scenarios.Specifically,human-object interaction detection aims to understand how humans interact with objects in images,and video action detection needs to recognize the human-centric actions in videos,where each person typically has interactions with other persons or objects.The main contributions are summarized as follows:(1)We propose a novel method named VSILNet for human-object interaction detection,which makes a combination of the local relation and contextual information.First,considering that the visual differences between different fine-grained interactions are too subtle to be detected,we introduce a pose-guided local relation branch(LRB).LRB first extracts the representations for the human keypoints and then learns the semantic relations between different body parts and objects,thereby capturing the fine-grained cues related to interactions.Second,to deal with the practical scenarios involving multiple persons and objects,we devise an instance relation branch(IRB).IRB aggregates contextual information from the scenes by modeling the relationships between instances.More importantly,we develop different graph structures to learn the interactions between instances of different types and the correlations between instances of the same type,respectively.Extensive quantitative and qualitative analyses on V-COCO and HICO-DET justify the superiority of the proposed method.(2)We propose a novel method named SLCNet for video action detection,which simultaneously models the interaction and class dependency.First,motivated by the fact that the action durations in videos are various,we propose a short-term interaction module(STIM)and a long-term interaction module(LTIM)to model short-term spatio-temporal interactions and long-term temporal dependencies between actors,respectively.In particular,based on the heterogeneity of space and time dimensions,STIM adopts a decoupling mechanism to separately handle the spatial interaction and short-term temporal interaction.Second,to resolve the multi-label classification problem of the dataset,we propose a class relation module(CRM),which mines the semantic dependencies between different action classes through self-attention mechanism.The class-level dependencies are used to aggregate the information of other related categories,thereby enhancing the discrimination ability of the original category representations.Extensive experiments on AVA v2.1 demonstrate the effectiveness of our method.In summary,we propose effective solutions for the tasks of human-object interaction detection and video action detection,which significantly promote the detection accuracy of interactive actions in both images and videos,making our research valuable for real-world applications. |