Font Size: a A A

Hand Pose Estimation Based On Depth Vision

Posted on:2024-04-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:P F RenFull Text:PDF
GTID:1528306944470294Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Hand is one of the most primitive and natural way of human interaction,and an important medium for human to convey intentions.Hand pose estimation aims to predict the 3D coordinate of the hand joints and the 3D hand model based on the input hand visual information.In today’s information age,this technology has a very wide range of application scenarios,such as virtual reality,augmented reality,smart classroom,smart cockpit,etc.In recent years,with the development of neural networks and the popularization of depth sensors,depth-based hand pose estimation has made great progress.However,there are still some unresolved technical difficulties in achieving accurate and robust hand pose estimation.On the one hand,existing methods are difficult to deal with problems such as hand selfocclusion,finger self-similarity,image noise,extreme viewpoint,etc.,and cannot fully mine the 3D spatial information of depth data,which are prone to predict inaccurate and unreasonable hand poses.On the other hand,existing methods heavily rely on large-scale labeled data.However,obtaining high-quality 3D hand pose annotations is time-consuming and labor-intensive,which severely restricts the development of hand pose estimation algorithms.This paper conducts in-depth research on these problems and proposes corresponding innovative solutions.The main contributions of the thesis are as follows:(1)In order to solve the problem of high freedom of hand posture and difficult network optimization,this thesis improves the two basic paradigms of hand pose estimation.For the holistic regression method,this thesis proposes a stacked regression network based on iterative refinement.This method utilizes hand pose re-parameterization techniques to convert predicted 3D hand poses into pixel-wise visual representations,which provide strong disambiguation cues for multi-stage pose refinement.For the element-wise estimation method,this thesis proposes a differentiable adaptive weighted aggregation mechanism,which directly obtains the 3D hand pose from the element-wise representation,discards the complicated post-processing process,and enables the network to perform end-to-end optimization.(2)Limited by the local receptive field mechanism,it is difficult for convolutional neural networks to model long-range visual dependencies and cannot handle hand selfocclusion and finger self-similarity well.This thesis proposes a pose-guided dynamic visual feature enhancement mechanism,which uses graph convolution to model the long-range dependencies of different hand regions and dynamically enhances visual features through captured global pose information.This approach significantly improves the accuracy and robustness of 3D hand pose estimation.(3)In order to fully mine the local visual information and 3D geometric structure information of depth data,this thesis proposes an image-point cloud hybrid network.The network can integrate the advantages of depth images and 3D point clouds,which uses convolutional neural networks to efficiently extract visual features from depth images,and capture the geometric structure information of depth data through a point cloud feature update mechanism based on sparse anchors.While maintaining the real-time inference speed of the model,it significantly improves the accuracy of the model.The method has achieved the best hand pose estimation accuracy in the same period.(4)For the problem that high-quality hand annotations are difficult to obtain,this thesis proposes self-supervised learning methods for multi-view scenes and single-view scenes.Multi-view information can alleviate the estimation ambiguity caused by hand self-occlusion and depth holes.This thesis proposes a multi-view feature adaptive fusion method based on graph convolution,which deeply mines the semantic information of each view and the dependencies between views.However,multi-view data limit the usage scenarios of selfsupervised algorithms.This thesis further proposes a dual-branch self-boosting selfsupervised framework suitable for single-view scenarios,which enables self-supervised training by maintaining the consistency between two heterogeneous branches.The method proposed in this thesis outperforms existing self-supervised method by a large margin,outperforms some supervised methods,and shows strong generalization ability.
Keywords/Search Tags:Hand Pose Estimation, Depth Image, 3D Point Cloud, Self-supervised Learning
PDF Full Text Request
Related items