Font Size: a A A

Hand Pose Estimation And Shape Reconstruction Based On Single RGB Images

Posted on:2024-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:J Y WangFull Text:PDF
GTID:2568306944450044Subject:Electronic information
Abstract/Summary:PDF Full Text Request
The hand is one of the most frequently used parts of the human body in daily life,and it plays a crucial role in human-computer interaction.3D hand pose estimation and shape reconstruction are important research topics in human-computer interaction,with extensive applications in fields such as healthcare,virtual reality,augmented reality,and more.With the rapid advancement of artificial intelligence technology,hand pose estimation and shape reconstruction based on deep learning have achieved remarkable results.Due to the flexibility of hand movements and the limited computational resources of edge devices,a balance must be struck between speed and accuracy in hand pose estimation and shape reconstruction.With the widespread availability of consumer-grade RGB cameras,RGB-based hand pose estimation and shape reconstruction tasks have received significant attention.In this paper,we explore how to fully leverage the features extracted by neural networks from a single RGB image,and enhance network performance with minimal computational overhead.In this paper,two approaches are proposed to incorporate 2D information to improve the accuracy of 3D coordinate prediction in the network,and proposes multiple modules to enhance network performance,achieving prediction accuracy that surpasses the representation of 2.5D coordinates.One approach involves multi-task learning with two branches that separately predict 2D and 3D coordinates.The shared backbone feature extraction network implicitly incorporates the 2D information into the 3D features,resulting in a significant improvement in prediction accuracy after incorporating the 2D information.Based on this,an analysis of hand joint loss values in the network is conducted,and a Multi-Root loss function is proposed to improve network performance without increasing computational overhead or parameters.The proposed Multi-Root loss function is generally applicable and can be used for other pose estimation tasks as well.To constrain the predicted 3D coordinates,this paper employs the weak perspective projection to reproject the 3D coordinates into 2D coordinates.By supervising the 2D coordinates to generate more accurate 3D coordinates,and refining the3 D coordinates through multi-stage refinement,the overall accuracy is improved.During the inference stage,the multi-task learning approach in the network allows for pruning of the 2D network branch,and the weak perspective projection module proposed in this paper can also be removed,thus improving the inference speed and reducing network parameters.The other approach explicitly maps the features extracted from the 2D network branch as2 D prior information to enrich the 3D representation.Experimental results show that the explicit representation of 2D information yields better predictions of 3D coordinates compared to implicit representation.Previous hand pose and shape reconstruction approaches directly mapped the high-level semantic features output by the backbone network to 3D features,ignoring the features of different layers and resolutions in the backbone network.To address the above-mentioned issue,this paper proposes a lightweight multi-scale sampling module to fuse different features in the backbone network.By projecting the 2D coordinates predicted by the two-dimensional network branch onto the features output by the multi-scale sampling module,pixel-level multi-scale information is extracted..Then,the 3D feature representation of hand mesh vertices is enriched by integrating the 2D prior information,multi-scale information,and high-level semantic information extracted from the backbone network.Subsequently,by adding multiple loss functions,weak perspective projection,and multi-stage refinement module,the performance of the network is further improved.The sizes of parameters of the two methods proposed in this paper are 109 M,115M,and the inferred speed on 2080 Ti is 107 FPS and 88 FPS,respectively.These methods have the few parameters and fastest inference speeds among advanced hand pose estimation and reconstruction methods,making them more suitable for practical applications.The 3D hand joint errors on the public hand dataset Frei HAND are 6.4mm and 6.2mm,respectively.Finally,the proposed two methods are augmented with test-time augmentation(TTA)technique,where the input images are processed in three different ways,and the predicted results are averaged,resulting in higher prediction accuracy for the network.
Keywords/Search Tags:hand pose estimation and shape reconstruction, RGB image, multi-task learning, multi-root loss, multi-feature fusion, multi-scale feature
PDF Full Text Request
Related items