AR/VR
Lifting for 3D Hand Pose Estimation
Mobile AR/VR devices rely on world views generated through low-cost 2D cameras. For example, a single 2D RGB camera is used to track the 3D location of a mobile phone by AR SDKs such as ARCore and ARKit. This project adds the tracking of the human hand to such SDKs, primarily targeting light-weight, low complexity environments such as AR glasses and mobile VR devices.
Keeping track of the users’ hands can enable rich interactions. For example, menu or gesture-based interactions with the hand allow the user to select options, input data, and point to interesting objects. Use cases with tactile feedback where menus are rendered in the palm of one hand while selections are made with the other can be enabled. The hand can also be used as a secondary controller extending user interaction in cases where the primary controller is inadequate or awkward.
A 2D picture is flat and offers few clues about the 3D pose and location of depicted objects. But when an object is very structured those few clues can be used to determine its 3D pose and location. We show through a skeletal model that the human hand is such an object whose 3D particulars can be solved for using a computationally simple 2D to 3D lifting algorithm.
Real Time Lifting Prototype.
The ambiguities involved in going from 2D to 3D are surprisingly contained for the hand primarily because of the many skeletal constraints which our technique takes advantage of. Furthermore the ambiguous cases can easily be detected (and remedied with extra processing when desired).
One of the nice benefits of the technique is that it doesn't need any 3D training data whatsoever. Provide it with 2D hand keypoints and say, the length of the index finger, it will generate 3D pose and location.
The technique is lightweight and comfortably super-real-time (an unoptimized implementation - no multithreading, vectorization, etc. - runs 300+fps on a single core assuming a single depicted hand.)
Strange but true: A lot of 3D key-point detectors in computer vision report results after aligning with ground-truth data through Procrustes analysis. This is OK for validating pose but hides bias errors that techniques have (how far is that finger-tip from the camera?). Accurate 3D location is very important when one is testing menu/touch interaction in 3D. The lifting work here can generate high quality results even when completely unaligned, with more than 80 % of keypoints within 2cm of ground-truth on a public dataset. Aligned results go over 85%. These are as good, if not better than current state-of-art (circa 2017). Not bad for a technique that doesn't need any 3D data to train, learn from, etc.
Beyond device deployment, if one had a good 2D hand keypoint data set that one wanted to convert to 3D, this technique would be a good candidate. Since it is accurate and can flag ambiguous cases one can generate a high confidence 3D dataset.
Depth Denoising and Edge Defattening
Active/passive stereo matchers utilize difference metrics calculated over windows which result in incorrect, broadened depth estimates around foreground objects. As a result depth edges, especially around closeby objects, tend to be inaccurate (also referred to as edge-broadening/fattening).
In AR scenarios where virtual assets are placed in scenes such broadening in turn results in inaccurate rendered occlusions.
In portrait mode of smartphone cameras broadening leads to incorrect blurring.
This work constructs a fast algorithm that identifies incorrect depth broadening and provides significantly better estimates.
Candy-Cane Networks For Easy Eye-Tracking
This model configures a UNet as a segmentation-regularized eye tracker.
The left side of the U enables the gaze model, the right side enables segmentation (pupil, glints, etc.) They are jointly trained.
Gaze data tends to be sparse whereas segmentation data is extensive. One hence uses the latter to regularize the former.
At inference only the gaze model is used. During training and debugging (say the gaze is wrong, is the segmentation also wrong?) the segmentation side is connected.
This is a simple decomposition that allows one to use a model that is well-known to be great at segmentation problems for eye tracking (with light modifications.)
Since one is not reinventing the wheel and is rather using familiar components no-one doubts the overall architecture. I dare say, and you would likely agree, if it is not working you are probably not training it right :)
One can place the dense network at various points of the UNet multi-resolution ladder depending on the imaging scenario.
Augmented Reality Enabled Telepresence and Interactions
Augmented Reality Enabled Telepresence Prototype. Real time AR enabled telepresence prototype using commodity 4-core CPU (circa 2011) + Kinect + HD webcam. Shown below connected to Futurewei Technologies telepresence room in Santa Clara, CA.
User in actual locale imaged with a depth camera and an HD webcam. HD video and segmentation information (30 fps) transmitted. At the remote end the user is posed to appear in a telepresence room.
Telepresence meetings connect two controlled rooms with high quality displays and communications gear. Often times however there are auxiliary nodes that connect to the meeting from uncontrolled environments with lower quality communications (the so called "CEO calling in from remote office visit" scenario.) This causes significant loss of immersion in the meeting.
This project brings together a suite of portable technologies that in effect allow the formation of a telepresence-cubicle practically anywhere. The caller from the telepresence-cubicle is presented as if they are in the same telepresence room at one of the main meeting sites.
The aim of the technology is not to give false impressions but to increase immersion. In addition to significantly increased immersion the technology enables bandwidth savings compared to regular video conferencing allowing higher quality visuals even in low bandwidth links.
Summary of technology components. Background removal using RGB + depth segmentation, boundary conditioning and artifact removal, temporal coherence enforcement, simple gesture-based interactions, and real-time video communications (including shape coding) using VP8/RTP. Each full-duplex communication node is powered by a commodity 4-core CPU, Kinect, and HD webcam.
Beyond its main functionality technology allows simple adjustments to meeting parameters, pointing gestures to slides, etc., using a natural user interface.
Visual Conditioning
When recomposing real-world foregrounds one often needs to deal with boundary artifacts due to depth-sensor noise or other segmentation noise. One hence has to devise boundary artifact removal that accomplishes boundary smoothing and filling in of missing values in a visually pleasing manner. With real-time HD targets one is of course limited with the amount of computational resources.
This technique uses spatially varying textures to blend foregrounds over adaptively designed and colored backgrounds. The goal is to hide artifacts (masked with textures) and form a visually pleasing color profile around the foreground. The background design uses (i) cross-hatching (contour-hatching and cross-hatching are drawing techniques used in art) that help with masking and (ii) harmonic colors.
A noisily-segmented foreground with light errors, recomposed after visual conditioning.
A noisily-segmented foreground with coarse errors, recomposed after visual conditioning.
Blending textures adapt to the foreground so that around smooth portions they are of lower frequency and intensity whereas, for example, around the hair they get directional with frequency matching local statistics.
In addition to their spatial masking properties, boundary smoothing and texture generation are designed not to generate spurious temporal noise.
Fixed backgrounds can be recolored to be harmonic with the target and recomposed.
Background
Recomposed
Background
Recomposed
Background
Recomposed
Background
Recomposed
Pattern Matching for Near-Duplicate Search
Patents:
O. G. Guleryuz and S. R. F. Fanello, "Object Observation Tracking In Images Using Encoder-Decoder Models," filed June 2021. Assigned to Google LLC.
O. G. Guleryuz, "Hand Skeleton Learning, Lifting, And Denoising from 2D Images," filed, December 2017. Assigned to Google LLC.
J. Ehmann, L. Zhou, O. G. Guleryuz, F. Lv, F. Zhu, and N. Dhar, “A System and Method for Augmented Reality-Enabled Interactions and Collaboration,” issued, February 2016. Assigned to Futurewei Technologies, Inc. Patent no: 9,270,943.
O. G. Guleryuz and A. Kalker, “Visual Conditioning for Augmented-Reality-Assisted Video Conferencing,” issued, October 2015. Assigned to Futurewei Technologies, Inc. Patent no: 9,154,732.
Disclosure:
O. G. Guleryuz and A. Csaszar, "Identifying and Correcting an Edge-Fattened Area Generated by Stereo-Matching Techniques," Technical Disclosure Commons, (May 2019.)
Papers:
O. G. Guleryuz and C. Kaeser-Chen, "Fast Lifting for 3D Hand Pose Estimation in AR/VR Applications," Proc. IEEE Int'l Conf. on Image Proc. (ICIP2018), Athens, Greece, Oct. 2018, {pdf}.
J. Ehmann and O. G. Guleryuz, “Temporally coherent 4D video segmentation for teleconferencing,” Proc. SPIE Conf. on Applications of Digital Image Processing XXXVI, San Diego, Aug. 2013, {pdf}.
O. G. Guleryuz and T. Kalker, “Visual Conditioning for Augmented-Reality-Assisted Video Conferencing,” Proc. IEEE Int’l Workshop on Multimedia (MMSP2012), Banff, Canada, Sept. 2012, {pdf}.