Skeleton-aware Multi-scale Heatmap Regression for 2D Hand Pose Estimation
Hand pose estimation plays an essential role in sign language understanding and human-computer interaction. Existing RGB-based 2D hand pose estimation methods learn the joint locations from a single resolution, which is not suitable for different hand sizes. To tackle this problem, we propose a new deep learning-based framework that consists of two main modules. The first one presents a segmentation-based approach to detect the hand skeleton and localize the hand bounding box. The second module regresses the 2D joint locations through a multi-scale heatmap regression approach that exploits the predicted hand skeleton as a constraint to guide the model. Moreover, we construct a new dataset that is suitable for both hand detection and pose estimation tasks. It includes the hand bounding boxes, the 2D keypoints, the 3D poses and their corresponding RGB images. We conduct extensive experiments on two datasets to validate our method. Qualitative and quantitative results demonstrate that the proposed method outperforms the state-of-the-art and recovers the pose even in cluttered images and complex poses.
El-Sawah A, Georganas ND, Petriu EM. A prototype for 3-D hand tracking and posture estimation. IEEE Transactions on Instrumentation and Measurement. 2008 Jun 27;57(8):1627-1636.
Supancic JS, Rogez G, Yang Y, Shotton J, Ramanan D. Depth-based hand pose estimation: data, methods, and challenges. In Proceedings of the IEEE international conference on computer vision 2015 (pp. 1868-1876).
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 770-778).
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition 2015 (pp. 3431-3440).
Ren S, He K, Girshick R, Sun J. Faster r-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems 2015 (pp. 91-99).
Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems 2012 (pp. 1097-1105).
Tompson J, Stein M, Lecun Y, Perlin K. Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (ToG). 2014 Sep 23;33(5):169-179.
Spurr A, Song J, Park S, Hilliges O. Cross-modal deep variational hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018 (pp. 89-98).
Wan C, Probst T, Van Gool L, Yao A. Dense 3d regression for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018 (pp. 5147-5156).
Zimmermann C, Brox T. Learning to estimate 3d hand pose from single RGB images. In Proceedings of the IEEE International Conference on Computer Vision 2017 (pp. 4903-4911).
Spurr A, Song J, Park S, Hilliges O. Cross-modal deep variational hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018 (pp. 89-98)
Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S et al. Ganerated hands for real-time 3d hand tracking from monocular RGB. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018 (pp. 49-59).
Gomez-Donoso F, Orts-Escolano S, Cazorla M. Large-scale Multiview 3D hand pose dataset. Image and Vision Computing. 2019 Jan 1; 81:25-33.
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention; Springer, Cham; 2015 Oct 5 (pp. 234-241).
Ren Z, Meng J, Yuan J, Zhang Z. Robust hand gesture recognition with Kinect sensor. In Proceedings of the 19th ACM international conference on Multimedia 2011 Nov 28 (pp. 759-760).
Hammer JH, Voit M, Beyerer J. Motion segmentation and appearance change detection based 2D hand tracking. In2016 19th International Conference on Information Fusion (FUSION) 2016 Jul 5 (pp. 1743-1750).
Kumar A, Zhang D. Personal recognition using hand shape and texture. IEEE Transactions on image processing. 2006 Jul 17;15(8):2454-2461.
Ong EJ, Bowden R. A boosted classifier tree for hand shape detection. In Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings. 2004 May 19 (pp.889-894).
Liu Z, Chai X, Liu Z, Chen X. Continuous gesture recognition with a hand-oriented spatiotemporal feature. In Proceedings of the IEEE International Conference on Computer Vision 2017 (pp. 3056-3064).
Hoang Ngan Le T, Zheng Y, Zhu C, Luu K, Savvides M. Multiple scale faster-RCNN approach to driver's cell-phone usage and hands-on steering wheel detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2016 (pp. 46-53).
Carreira J, Agrawal P, Fragkiadaki K, Malik J. Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 4733-4742).
Bulat A, Tzimiropoulos G. Human pose estimation via convolutional part heatmap regression. In European Conference on Computer Vision; Springer, Cham; 2016 Oct 8 (pp. 717-732).
Garcia-Hernando G, Yuan S, Baek S, Ki,m TK. First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018 (pp. 409-419).
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017 (pp.4903-4911).
Iqbal U, Molchanov P, Breuel Juergen Gall T, Kautz J. Hand pose estimation via latent 2.5 d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV) 2018 (pp. 118-134).
Kong D, Chen Y, Ma H, Yan X, Xie X. Adaptive graphical model network for 2d hand pose estimation. arXiv preprint arXiv:1909.08205. 2019 Sep 18.
Li S, Chan AB. 3d human pose estimation from monocular images with a deep convolutional neural network. In Asian Conference on Computer Vision 2014 Nov 1 (pp. 332-347). Springer, Cham.
Duan L, Shen M, Cui S, Guo Z, Deussen O. Estimating 2d multi-hand poses from single depth images. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops 2018.
Wang Y, Peng C, Liu Y. Mask-pose cascaded CNN for 2d hand pose estimation from a single color image. IEEE Transactions on Circuits and Systems for Video Technology. 2018 Nov 9;29(11):3258-68.
Wang Y, Zhang B, Peng C. Srhandnet: Real-time 2d hand pose estimation with simultaneous region localization. IEEE transactions on image processing. 2019 Nov 28; 29:2977-86.
Kong D, Ma H, Xie X. Sia-GCN: A spatial information aware graph neural network with 2d convolutions for hand pose estimation. arXiv preprint arXiv:2009.12473. 2020 Sep 25.
Simon T, Joo H, Matthews I, Sheikh Y. Hand key-point detection in single images using Multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017 (pp. 1145-1153).
Pisharady PK, Vadakkepat P, Poh LA. Hand Posture and Face Recognition Using Fuzzy Rough Approach. In Computational Intelligence in Multi-Feature Visual Pattern Recognition; Springer, Singapore; 2014 (pp. 63-80).
Potter LE, Araullo J, Carter L. The leap motion controller: a view on sign language. In Proceedings of the 25th Australian computer-human interaction conference: augmentation, application, innovation, collaboration 2013 Nov 25 (pp. 175-178).
Beardsley P, Murray D, Zisserman A. Camera calibration using multiple images. In European Conference on Computer Vision; Springer, Berlin, Heidelberg; 1992 May 19 (pp. 312-320).
This work is licensed under a Creative Commons Attribution 3.0 License.