## Objective Mismatch in Model-based Reinforcement Learning

Anonymous Authors During Review

**Contents:**

## Abstract:

Model-based reinforcement learning (MBRL) has been shown to be a powerful framework for data-efficiently learning control of continuous tasks. Recent work in MBRL has mostly focused on using more advanced function approximators and planning schemes, leaving the general framework virtually unchanged since its conception. In this paper, we identify a fundamental issue of the standard MBRL framework -- what we call the * objective mismatch issue*. Objective mismatch arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In the context of MBRL, we characterize the objective mismatch between training the forward dynamics model w.r.t. the likelihood of the one-step ahead prediction, and the overall goal of improving performance on a downstream control task. For example, this issue can emerge with the realization that dynamics models effective for a specific task do not necessarily need to be globally accurate, and vice versa globally accurate models might not be sufficiently accurate locally to obtain good control performance on a specific task. In our experiments, we study this objective mismatch issue and demonstrate that the likelihood of the one-step ahead prediction is not always correlated with downstream control performance. This observation highlights a critical flaw in the current MBRL framework which will require further research to be fully understood and addressed. To this purpose, we conclude with a discussion about potential directions of future research for addressing this issue.

# New Experiments

## Addressing Mismatch

Tweaking dynamics model training can partially mitigate the problem of objective mismatch. While keeping the NLL minimization method used in MBRL, the trainer can prioritize state transitions associated with an expert trajectory more than points further from expert transitions, and dynamics model will use its capacity to learn relevant tasks more quickly. We explore the effects of re-weighting network training by a measure of Euclidean distance in the state-action space of cartpole.

If the state space of an environment is considered as the concatenation of the states and actions, we can quantify the distance of any two tuples {(s_i, a_i, s_i'), i=1,2} as d=||[s_1, a_1]-[s_2,a_2]||. With this distance, we re-weight the loss, l(y), of points further from an expert policy to be lower, so that points in the expert trajectory get a weight w(y)=1, and points at the edge of the grid dataset used in the paper get a weight w(y)=0.

With this notion of distance and weighting, we used the expert dataset discussed in Experiments as a distance baseline. We generated a base dataset consisting of 25000000 tuples of (s, a, s') by uniformly sampling across the state and action space of cartpole (d_s+d_a=5). We taxonomized this dataset by taking the minimum orthogonal distance, d*, from each of the points to the 200 element dataset from one expert trajectory that achieved a reward of 180. In order to create different datasets that range from near-expert to nearly uniform across the state space, we vary the distance bound, epsilon, and number of points S trained on.

For each sampling bound, epsilon, we sampling 5 different datasets such that d* < epsilon from our distance-tabulated random dataset, and for each dataset defined by (S, epsilon) we trained 5 different P models, giving 25 samples evaluated with the PETS algorithm on the cartpole task for each point in the heatmap shown below. This simple form of re-weighting the neural network loss by a linear term showed an increase in sample efficiency to learn the cartpole task. Developing an iterative method to re-weight samples in an online training method has the potential to further improve the sample efficiency of MBRL baselines.

### Model Training on Different Datasets

Mean reward of PETS trials on a log-grid of dynamics model training sets with number of points S ∈ [10, 10000] and sampling expert-distance bounds ε ∈ [.28, 15.66]. The performance of PETS declines when the dynamics model is trained on points too near to the expert dataset because the model lacks robustness when running online with the stochastic MPC.

### Changes with Model Re-Weighting

Mean reward of PETS trials with training re-weighting. The re-weighting shows an ability to learn moderate performance at substantially lower number of datapoints (focus on areas of datasert size <1000), but suffers from increased variance in larger set sizes. This change to address mismatch is an exciting direction of future work.

### Key Areas in Dataset Distribution Plots

In the above two plots, there are two key area to note:

- The area on the bottom of both plots, where the distance from the expert trajectory is very low. There is an increase in performance as the sampling bound increases from 0 to 1, because this allows the models to me more robust with the stochastic MPC.
- The left half of both plots, where the dataset size is still low. The re-sampling method we implement shows an increased ability to perform on low amounts of data. While this method still needs more refining, it shows the capability for model-based RL to gain increased sample efficiency.

## Effect of Dataset Distribution when Learning

### Additional Off-Expert Data, 1 Task

Learning speed can be slowed by many factors in dataset distribution, such as adding additional irrelevant transitions. When extra transitions from a specific area of the state space are included in the training set, the dynamics model will spend increased expression on these transitions. NLL of the model will be biased down as it learns this data, but it will reduce the learning speed as new, more relevant transitions are added to the training set.

Running cartpole random data collection with a short horizon of 10 steps (while forcing initial babbling state to always be 0), for 20,200,400 and 2000 babbling rollouts (that sums up to 200, 2000, 4000 and 20000 transitions in the dataset finally shows some regression in the learning speed for runs with more useless data in the motor babbling. This data highlights the importance of careful exploration vs exploitation tradeoffs, or changing how models are trained to be selective with data.

### Model Generalization to Multiple Tasks

In this section, we compare the performance of a model trained on data for the standard cartpole task (x position goal at 0) to policies attempting to move the cart to different positions in the x-axis. Left is a learning curve of PETS with a PE model using the CEM optimizer. Even though performance levels out, the NLL continues to decrease as the dynamics models accrue more data. With more complicated systems, such as halfcheetah, the reward of different tasks verses global likelihood of the model would likely be more interesting (especially with incremental model training) - we will investigate this in future work. Below, we show that the dynamics model generalizes well to tasks close to zero (both positive and negative positions), but performance drops off in areas the training set does not cover as well.

Below the learning curves, we include snapshots of the distributions of training data used for these models at different trials, showing how coverage relates to reward in cartpole. It is worth investigating how many points can be removed from the training set while maintaining peak performance on each task.

Task performance with goal positions less than zero.

Task performance with goal positions greater than zero.

After 1 Trial.

After 5 Trials

After 20 Trials.

# Exploring Model Likelihood vs Episode Reward

### Half Cheetah Log Likelihood v Reward

Reward Achieved with Dynamics Model, Average 10 Episodes

This plot is an improved version of an original analysis (below) that attempts to account for episodic uncertainty in MBRL with a MPC planner. We loaded all of the models that we evaluated the NLL of various datasets on, and evaluated 10 episodes with a CEM based MPC. Due to the large number of models evaluated in the original analysis, the averaging over episode does not change the trend. There is only a loose relationship between episode reward and model likelihood, so more specific metrics of model training are needed to guarantee performance.

Original Data of Reward (1 Episode) vs NLL

### Cartpole Log Likelihood v Reward

New Cartpole Data (Average of 10 Episodes)

Original Cartpole Data