Font Size: a A A

Research On Human-Object Interaction Detection Based On Deep Learning

Posted on:2024-02-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y M ChengFull Text:PDF
GTID:1528307079951459Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile networks and the explosive growth of multimedia data,computer vision has become the main battlefield of the new generation of artificial intelligence.Computer vision aims to enable computers to understand the content information in images or videos.Currently,basic visual cognition tasks,such as image classification and object detection,have achieved excellent performance,while the high-level semantic understanding performance in images or videos is relatively weak.To more accurately understand the high-level semantic information in images,this dissertation studies the relationships between detected instances,such as persons and objects,on the basis of basic visual cognition tasks.Accurately understanding the relationships between instances will promote the understanding of high-level visual information,such as image captioning.Specifically,this dissertation focuses on human-object interaction(HOI)detection,which studies the interaction relationships between instances(humans and objects)in images and determines them in the form of triplets(human,interaction relationship,object),where the position and instance category of the human-object instance need to be determined.Currently,neural network models have achieved excellent performance in image recognition and object detection tasks,which demonstrates the ability of neural networks to extract instance information from images.Therefore,the challenge for human-object interaction(HOI)detection is how to fully utilize the features extracted by neural networks,enabling the model to accurately infer interaction relationship information in various scenarios.To address this research objective,this dissertation investigates three aspects: depth information deficiency,instance scale differences,and model generalization in under-annotated environments.The main contributions can be summarized as follows:(1)To deal with the problem of depth information deficiency,this dissertation proposes incorporating depth information into visual features to alleviate the lack of the third dimension information between instances.We design the DRR model,a depth-enhanced relationship reasoning method that allows neural networks to learn and benefit from RGB images and depth information.Firstly,a pre-trained depth estimation model is applied to generate the corresponding depth map from a given RGB image.Then,in the encoding stage,four branches are allowed to extract multiple semantic features.In the decoding stage,the aforementioned features are further integrated using a hierarchical attention mechanism to generate depth information-enhanced features for the accurate classification of instance relationships.With the help of depth information,the proposed DRR achieves superior performance on two HOI datasets compared to previous state-of-the-art methods.(2)This dissertation proposes a multi-scale fusion architecture to address the problem of differences in instance scales.The proposed approach utilizes multi-scale patch representation and multi-path structure to effectively eliminate the impact of different instance scales on HOI detection.Additionally,anchor points are introduced as reference points for aggregating multi-scale semantic features,while considering the variability of object positions.Finally,necessary auxiliary techniques are employed to improve the performance of HOI detection,including auxiliary decoding loss,iterative box refinement,and progressive class calibration.Extensive evaluation demonstrates that the proposed method outperforms previous state-of-the-art methods.(3)To tackle model generalization in scenarios with insufficient annotation,this dissertation introduces unsupervised domain adaptation(UDA)to the human-object interaction(HOI)detection task,and proposes a novel HOI detector with unsupervised domain adaptation(HOI-U)to train an adaptive prediction model that can generalize well to target domains with scarce labels.This dissertation introduces two domain discrepancy reduction schemes,which learn domain-invariant features by aligning features at both the maps and sequence levels.Additionally,to achieve the separation of inter-class interactions and the tight clustering of intra-class interactions,this dissertation proposes a memory bank to minimize the feature differences related to interaction relationships.Finally,this dissertation extensively evaluates our proposed method and the significant improvement brought by it on different models proves its universality.Finally,this dissertation summarizes the aforementioned research and provides future research directions that may have significant implications for the development of humanobject interaction detection tasks.
Keywords/Search Tags:Human-object Interaction(HOI) Detection, Depth Estimation, Multi-scale, Unsupervised Domain Adaptation
PDF Full Text Request
Related items