Home‎ > ‎

Eccv2016

Spatial Attention Deep Net with Partial PSO for Hierarchical 
Hybrid  Hand Pose Estimation

Qi Ye*, Shanxin Yuan*, Tae-Kyun Kim
Department of Electrical and Electronic Engineering,  Imperial College London
{ q.ye14,s.yuan14,tk.kim}@imperial.ac.uk
Motivation
  • Existing hierarchical methods mainly focus on the decomposition of the output space while the input space remains almost the same along the hierarchy. 
  • The spatial attention mechanism is proposed  to integrate cascaded and hierarchical regression into a CNN framework by transforming both the input(and feature space) and the output space, which greatly reduces the viewpoint and articulation variations.
  • Between the levels in the hierarchy, the hierarchical PSO forces the kinematic constraints to the results of the CNNs.
Structure and Hand Model

           

  • The Spatial Attention Mechanism integrates  the cascaded and hierarchical hand pose estimation into one framework. 
  • The hand pose is estimated layer by layer in the order of the articulation complexity, with the spatial attention module to transform the input, feature and output space. 
  • Within each layer, the partial pose is iteratively refined both in viewpoint and location with the spatial attention module, which leads both the feature and output space to a canonical one. 
  • After the refinement, the partial PSO is applied to select estimations within the hand kinematic constraints among the results of the cascaded estimation.


Annotations of the datasets

As the annotations of the datasets used in our paper do not conform to each other, we use the annotation version in [1].
  • Each line is corresponding to one image.
  • Each line has 21x3 numbers, which indicates (x, y, z) of 21 joint locations in the camera coordinate system in the unit of meter.
  • The order of the joints in each line is wrist joint, thumb joints, index joints, middle joints, ring joints, pinky joints, i.e. 00,01,11,21,31,02,12,22,32,03,13,23,33,04,14,24,34,05,15,25,35 as demonstrated in the hand skeleton figure.
  • ICVL hand dataset
    Our annotation: ICVL_21jnt_test_ground_truth.csv, 
    ICVL_21jnt_train_ground_truth.csv
  • NYU hand dataset
    Our annotation: NYU_21jnt_test_ground_truth.csv, NYU_21jnt_train_ground_truth.csv
  • MSRC hand dataset
    Our annotation: 
    MSRC_21jnt_test_ground_truth.csv ,MSRC_21jnt_train_ground_truth.csv

Estimation Result
In our experiment part, we do not augment any dataset such as rotation, translation, which many other papers do to improve the performance. We use the original depth image whose hand area is cropped by the ground truth of joint locations. The hand area and the ground truth is normalized by the mean of the cropped hand pixels. Note that some methods cited in our paper normalize the cropped hand using the ground truth location of a certain joint, which simplifies the estimation problem by eliminating the global translation.
  • Error of the estimation result for ICVL: ICVL_21jnt_hier_hybrid_sa_error.csv
    To compare with previous work mentioned in our paper, we use the joints 01,02,03,04,05,11,12,13,14,15,21,22,23,24,25.
  • Estimation joint locations for NYU: NYU_21jnt_hier_hybrid_sa_prediction.csv
    To compare with previous work mentioned in our paper, we use the joints 00,11,12,13,14,15,31,32,33,34,35 in our anotation, which corresponds to [30,27,21,15,9,3,24,18,12,6,0] indexed by the original NYU dataset(it is approximation as we adopt different annotations).
  • Estimation joint locations for MSRC : MSRC_21jnt_hier_hybrid_sa_prediction.csv.
  • Our convention code betweenpixel coordinates (denoted as uvd) and camera coordinates (denoted as xyz): xyz_uvd.py

Bibtex
@InProceedings{YeSpatialHandECCV2016,
author = {Ye, Qi and Yuan, Shanxin and Kim, Tae-Kyun},
title = {Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation},
booktitle = {The European Conference on Computer Vision (ECCV)},
year = {2016}
}



[1] Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.K., Shotton, J.: Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In: ICCV (2015)
* indicates equal contribution


ċ
ICVL_21jnt_hier_hybrid_sa_error.csv
(274k)
Helen QQ,
3 Oct 2016, 03:13
ċ
ICVL_21jnt_test_ground_truth.csv
(826k)
Helen QQ,
3 Oct 2016, 03:13
ċ
MSRC_21jnt_hier_hybrid_sa_error.csv
(1028k)
Helen QQ,
3 Oct 2016, 03:14
ċ
MSRC_21jnt_test_ground_truth.csv
(1028k)
Helen QQ,
3 Oct 2016, 03:14
ċ
NYU_21jnt_hier_hybrid_sa_error.csv
(4240k)
Helen QQ,
3 Oct 2016, 03:13
ċ
NYU_21jnt_test_ground_truth.csv
(4240k)
Helen QQ,
3 Oct 2016, 03:14
Ċ
Helen QQ,
11 Sep 2016, 11:55
Ċ
Helen QQ,
11 Sep 2016, 11:55
ċ
xyz_uvd.py
(3k)
Helen QQ,
22 Nov 2016, 07:33
Comments