Assisted Perception:

Optimizing Observations to Communicate State

Paper (appeared at CoRL 2020) | Video | Code

Siddharth Reddy, Sergey Levine, Anca D. Dragan

University of California, Berkeley

We aim to help users estimate the state of the world in tasks like robotic teleoperation and navigation with visual impairments, where users may have systematic biases that lead to suboptimal behavior: they might struggle to process observations from multiple sensors simultaneously, receive delayed observations, or overestimate distances to obstacles. While we cannot directly change the user's internal beliefs or their internal state estimation process, our insight is that we can still assist them by modifying their observations. Instead of showing the user their true observations, we synthesize new observations that lead to more accurate internal state estimates when processed by the user.

We refer to this method as assistive state estimation (ASE): an automated assistant uses the true observations to infer the state of the world, then generates a modified observation for the user to consume (e.g., through an augmented reality interface), and optimizes the modification to induce the user's new beliefs to match the assistant's current beliefs. To predict the effect of the modified observation on the user's beliefs, ASE learns a model of the user's state estimation process: after each task completion, it searches for a model that would have led to beliefs that explain the user's actions.

We evaluate ASE in a user study with 12 participants who each perform four tasks: two with known biases, and two with unknown biases that our method has to learn. ASE's general-purpose approach to synthesizing informative observations enables a different assistance strategy to emerge in each domain.

In a bandwidth-limited MNIST image classification task, ASE helps the user identify the digit using fewer pixels.

In a guided 2D navigation task based on Habitat environments, ASE identifies nearby landmarks, which helps a bandwidth-limited user infer their position and orientation.

In the Car Racing video game with observation delay, ASE uses a dynamics model to fill in missing frames, which helps the user make time-sensitive steering decisions.

In the Lunar Lander teleoperation video game, ASE learns to exaggerate a visual indicator of tilt, which helps the user detect small tilts early and correct them.

Supplementary Videos


Lunar Lander without ASE: user tends to underestimate tilt, which prevents them from reacting quickly enough to prevent extreme tilt


Lunar Lander with ASE: the tilt indicator is exaggerated, which helps the user react more quickly and prevent extreme tilt


Car Racing without ASE: during delay phases when the real observation stops updating, it's harder for the user to steer the car due to the difficulty of visualizing where the car actually is at the moment


Car Racing with ASE: during delay phases when the real observation stops updating, ASE fills the gap with plausible synthetic images so that the user can do real-time, closed-loop control without dealing with the intermittent delay

Demo of the MNIST digit labeling interface from the user study


In the unassisted condition, revealing rows in order from top to bottom is not as quick to reveal informative pixels. The random baseline tends to spread them out uniformly throughout the image, which is a good strategy in the long run but does not necessarily reveal informative pixels early in the episode. ASE tends to quickly reveal rows near the middle and rows with many non-zero pixels, enabling the user to more accurately guess the label earlier. I


Demo of the 2D navigation experiments with simulated users in indoor Habitat environments

Demo of the 2D navigation interface from the user study