Seeing is not Believing:

Robust Reinforcement Learning against Spurious Correlation

Published on NeurIPS 2023

Wenhao Ding*, Laixi Shi*, Yuejie Chi, Ding Zhao

Carnegie Mellon University

[arXiv / Code (coming soon)]

Motivation

Consider an example in a driving scenario, where a shift between training and test environments caused by an unobserved confounder can potentially lead to a severe safety issue. Specifically, the observations of brightness and traffic density do not have cause and effect on each other but are controlled by a confounder (i.e. sun and human activity) that is usually unobserved to the agent. During training, the agent could memorize the spurious correlation between brightness and traffic density, i.e., the traffic is heavy during the daytime but light at night. However, such correlation could be problematic during testing when the value of the confounder deviates from the training one, e.g., the traffic becomes heavy at night due to special events and human activity changes), as shown at the bottom of the figure. Consequently, the policy dominated by the spurious correlation in training fails on out-of-distribution samples (heavy traffic at night) in the test scenarios.

Problem formulation of State-confounded MDP

We compare the formulation of our State-confounded MDP (SC-MDP) with other similar formulations. The main difference between SC-MDP and others is that we consider the spurious correlation inside of the state, which opens a backdoor between action and state dimensions.

Model architecture for causal discovery 

We design an encoder-decoder structure with a learnable binary matrix as the bottleneck. This binary matrix serves as a causal graph to control the combination of features for predicting the next state and reward given the current state and action. To make the learning differentiable, we use Gumbel-Softmax to sample the value of the causal graph.

 Experiment results in Robosuite

We train both SAC and RSC-SAC (Ours) in the in-distribution environment and test them in both in-distribution and out-of-distribution environments. SAC fails when it is tested in the out-of-distribution environment, while our RSC-SAC can still succeed. 

Lift

Nominal environment 1

(for training and testing)

the red block is always on the right side

Nominal environment 2

(for training and testing)

the green block is always on the left side.

Shifted environment 1

(for testing)

the red block is always on the left side

Shifted environment 2

(for testing)

the green block is always on the right side.

SAC (tested on nominal 1)

SAC (tested on shifted 1)

Ours (tested on nominal 1)

Ours (tested on shifted 1)

Stack

Nominal environment 1

(for training and testing)

both block and the target position are on the left side of table

Nominal environment 2

(for training and testing)

both block and the target position are on the right side of table

Shifted environment 1

(for testing)

the block is on the left side and the target position is on the right side

Shifted environment 2

(for testing)

the block is on the right side and the target position is on the left side

SAC (tested on nominal 2)

SAC (tested on shifted 1)

Ours (tested on nominal 1)

SAC (tested on shifted 1)

Wipe

Nominal environment 1

(for training and testing)

the red block is on the right side and the dirty region is from left-top to right-bottom

Nominal environment 2

(for training and testing)

the red block is on the left side and the dirty region is from left-bottom to right-top.

Shifted environment 1

(for testing)

the red block is on the left side and the dirty region is from left-top to right-bottom

Shifted environment 2

(for testing)

the red block is on the right side and the dirty region is from left-bottom to right-top.

SAC (tested on nominal 2)

SAC (tested on shifted 1)

Ours (tested on nominal 2)

Ours (tested on shifted 1)

Door

Nominal environment 1

(for training and testing)

the handler is in the low position of the door and the door is close to the robot

Nominal environment 2

(for training and testing)

the handler is in the high position of the door and the door is far from the robot.

Shifted environment 1

(for testing)

the handler is in the low position of the door and the door is far from the robot

Shifted environment 2

(for testing)

the handler is in the high position of the door and the door is close to the robot.

SAC (tested on nominal 2)

SAC (tested on shifted 1)

Ours (tested on nominal 2)

SAC (tested on shifted 1)