To detect hands and their corresponding objects from RGB photos alone, we train the ResNet neural network model to draw and identify bounding boxes around each. First, we collect and label thousands of photos of hands to combine with an existing annotated dataset from "100 Days of Hands" [1]. This dataset is used to train four models with different epochs and learning rates. The resulting models are tested and evaluated, and their accuracies are graphed based on precision, recall, and mAP.
Following hand-detection, we reconstruct their 3D meshes using pose and shape in the MANO model [2] by fully integrating the hand-detection model with an end-to-end 3D neural network. Ultimately, from a raw photo, we are able to accurately place bounding boxes around hands and objects, identify hand side and contact state, and achieve 3D hand-pose reconstruction.
With multiple camera angles, hands, objects, and interactions, we captured over 8000 photos and annotated[3] over 450 photos to add to an existing dataset of in-the-wild data[1]. Each image includes at least one hand which has a specific contact state denoted by an abbreviation:
N: no contact
S: self contact
P: portable object contact
O: other person contact
The hands and objects are sorted similarly:
L: left hand
R: right hand
O: object
ex. R-N indicates a right hand with no contact; O denotes object; L-O means left hand with other person contact. etc.
Examples are shown below:
Although our own annotated dataset and the existing dataset[1] are both stored in .xml format, the attributes for hand and object classes are different. While our own dataset stores the hand side, hand state, and contact state in the label, the existing dataset stores this information as attributes of the classes. To combine the two datasets for training, we wrote a script that obtains and calculates the information needed and converts our dataset to the existing format.
The dataset, originally stored in a folder of .xml files, is split randomly between training, validation, and testing in a 50:20:30 split, and the names of the dataset files are appended to the original list of files (train.txt, val.txt, test.txt). We trained multiple models with different learning rates and epochs (model 1: learning rate 1e-3, epochs 10; model 2: 1e-4, epochs 8; model 3: 1e-4, epochs 10; model 4: 1e-5, epochs 10).
The model takes approximately two days to train. The detection model is a faster RCNN model that has a ResNet-101 BaseNet and 4 additional fully connected layers to output hand object detections with 2 auxiliary state outputs and 1 offset vector.
The testing data is fed into the model to evaluate the performance. The accuracy of the model is measured in Average Precision (AP) [4], a value between 0 and 1, which is evaluated by three parameters: precision, recall, and average precision. Precision, calculated by dividing the percentage of true positives by the percentage of returned positives, along with recall, calculated by dividing the percentage of true positives by the percentage of actual positives is graphed as ordered pairs in a scatterplot. The Average Precision is then calculated by the area under the curve. The closer the average precision is to 1, the more precise the model. Below is a table of the average precisions for attributes of the hand object classes, including hand state, contact state, etc. As shown, model 1, defined by a learning rate of 1e-3 and 10 epochs, consistently outperforms the other models and is comparable with the reference[5] which will be the final model.
We fully integrated our hand-detection model with an existing model[6] that generates 3D hand pose estimation to enable multi-focal estimation.
There are several existing papers that address 3D hand pose reconstruction from a 2D image. Among which is a paper featuring a model which, given an image containing a hand in contact with an object, generates a graph for 3D hand-object pose estimation. However, the existing model comes with several shortcomings: since it does not implement hand detection, it always expects the hand to be centered, which lowers the accuracy of the pose estimation when the hand is not centered in the image. The model can only accept one hand in the image, so in the case of multiple hands present, it simply picks a random hand instead of processing all the hands. In addition, the model is trained mostly on left hands, thus in the presence of a right hand, the model flips the image and processes as a left; as a result, it fails to recognize the hand side, requiring user specification before execution.
To combat these issues, we combine the paper with our hand-detection model mentioned above. Given a raw video clip, bounding boxes of hands are determined by the hand-detection model, which determines a subset of the image consisting of each cropped hand. The subset consisting of is then fed into the model for pose estimation. After the integration, we extend the capability of our model to 3D reconstruction and greatly improve the accuracy of the existing model, especially in cases where hands are off-centered. The hand-detection model also provides the identification of hand sides to facilitate 3D reconstruction, automating the process.
[1] Shan, Dandan. “100 Days of Hands.” 100DOH, fouheylab.eecs.umich.edu/~dandans/projects/100DOH/download.html.
[2] Romero, Javier, et al. “Embodied Hands: Modeling and Capturing Hands and Bodies Together.” ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), vol. 36, Nov. 2017, pp. 1–17. 245:1--245:17, DOI:10.1145/3130800.3130883.
[3] Lin, Darren. “LabelImg.” Github, 27 June 2021, github.com/tzutalin/labelImg.
[4] Hui, Jonathan. “MAP (Mean Average Precision) for Object Detection.” Medium, 3 Apr. 2019, jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173.
[5] Shan, Dandan, et al. “Understanding Human Hands in Contact at Internet Scale.” Arxiv, University of Michigan, Johns Hopkins University, 11 June 2020, arxiv.org/pdf/2006.06669.pdf.
[6] Hasson, Yana, et al. “Learning Joint Reconstruction of Hands and Manipulated Objects.” Arxiv, Inria, Departement d’Informatique De l’ENS, CNRS, PSL Research University, MPI for Intelligent Systems, Univ. Grenoble Alpes, Grenoble INP, LJK, 11 Apr. 2019, arxiv.org/pdf/1904.05767.pdf.