Abstract Form of Offline RL Objective Without Data Sharing
Prior works (e.g. CQL and CDS) have shown that the optimal policy that optimizes the above equation attains a high probability safe-policy improvement guarantee as follows:
Safe Policy Improvement Guarantee of UDS
To intuitively interpret the various terms that appear, we note that term (b) corresponds to the standard policy improvement that arises as a result of using the offline RL algorithm; term (c) corresponds to sampling error that arises as a result of performing offline RL on the dynamics induced by a finite dataset, term (d) corresponds to the sampling error due to a stochastic reward function, and term (a) corresponds to the bias incurred as a result of labeling various transitions with a 0 reward in the data.
How does UDS compare to No Sharing? In the setting when no data is shared across tasks, we attain the guarantee shown in the first section for standard CQL . Comparing Proposition D.1 to this guarantee, we note that under some circumstances, UDS yields a tighter bound compared to No Sharing. For instance, consider a scenario where tasks have long horizons H = 1 / (1 - \gamma) and the effective dataset size of task i is H^2 times the original dataset size of task i. In this case, dynamics sampling error term (term (c)) consists of one less factor of H when UDS is utilized, compared to when it is not. Since the dynamics sampling error grows quadratically in the horizon, whereas other terms grow linearly, a reduction in this term by increasing sample size (i.e., denominator) can lead to a stronger guarantee for UDS than No Sharing. This reasoning does not even consider term (d), which can be trivially upper-bounded by the corresponding term for No Sharing, even though UDS reduces this term as well.
Abstract Form of the CUDS Objective
Safe Policy Improvement Guarantee of CUDS
Comparing CUDS and UDS. While UDS and CUDS both attain similar-looking guarantees, there are major differences. First, since the abstract form of CUDS actually optimizes for the behavior policy, similar to CDS (Yu et al. 2021a, Equation 14), we can show that it reduces the distributional shift term appearing in the numerator of term (c). In addition, since CUDS adds unlabeled data, it also increases the denominator of term (c), further reducing sampling error. Unlike CUDS, the distributional shift between the learned policy and the behavior policy may increase for UDS, while CUDS guaranteedly reduces this quantity. Finally, we note that the CUDS policy improves over the optimized behavior policy learned by the abstract CUDS model, which we expect will perform better than the effective behavior policy obtained by naively adding all the behavior data like in UDS. Therefore, CUDS reduces distributional shift and improves the baseline policy over which we can improve via the safe-policy improvement bound, while retaining benefits of a reduced sampling error, due to an increased dataset size, like in UDS.
For proofs of each proposition and arguments, please see here.
Ablation on data with varying success rates. We now present the performance of UDS in settings where the actual success rate of the data relabeled from other tasks ranges from 5% to 90%. As shown in the Table below, UDS still attains competitive results in settings where the success rates of relabeled data are high (50% and 90%) compared to UDS without controlling success rate of the relabeled data, while UDS struggles in settings where the success rate of relabeled data is low (5%). Therefore, UDS is able obtain good performance in settings where reward bias on relabeled data is high, which suggests that our theoretical results generally hold in practice.
Results of CUDS and UDS on multi-task problems with continuous rewards. Next, we show the results of CUDS and UDS on the multi-task walker environment introduced in prior work (CDS), which has a continuous reward unlike the settings we evaluated in our main text where the rewards are binary. In this setting, instead of relabeling the reward from other tasks as 0, we use the minimum reward in the offline dataset to relabel other task data.
As shown in the table below, CUDS and UDS performs comparably to CDS and Sharing All that have access to the ground-truth rewards. CUDS also outperforms No Sharing. Such results suggest that CUDS and UDS are able to generalize to dense reward settings and we can remove the assumption of the binary reward for CUDS and UDS.