The goal of this work is to perform hand and arm detection over continuous sequences that exceed one hour in length. Our work is motivated by the need in automated British sign language recognition to reliably find the position of the hands in every frame of a video.
We cast the problem as inference of a generative model for the image. Under this model limb detection is expensive due to the very large number of possible configurations each part can assume. We make the following contributions to reduce this cost: (i) using efficient sampling from a pictorial structure proposal distribution to obtain reasonable configurations; and (ii) identifying a large set of frames where correct configurations can be inferred, and then using temporal tracking elsewhere.
Results are computed for signing footage with changing background, challenging image conditions, and different signers. We show that our proposed model is able to identify the true arm and hand locations, and the robustness of our inference and temporal limb detection is also demonstrated. The composite method exceeds the state-of-the-art for the length and stability of continuous limb tracking.
Example videos: signer1.avi signer2.avi signer3.avi (xVid codec)
Randomly chosen results for 3 different signers